Abstract

This dataset has been created from a collection of 56403 multidisciplinary book titles from Springer, available through the Hellenic Academic Libraries Link (https://www.heal-link.gr/en/home-2/) subscription. To obtain this dataset, a parser was created for extracting relevant information, such as the title, subtitle and ToC, from each book. The extracted information was stored in a database for further processing. Each book title in the database includes information regarding the bookid, title, and ToC. As a next step, a team of librarians who were working in the NTUA Digital Library manually added the subject field information. This dataset contains the primary subject field as each book’s label. In the 5 categories sub set there is also another field that contains the secondary labels for each book in the collection

This dataset can serve as a basis for multiclass classification problems and/or content recommendation. The 5 categories subset can also be used for multilabel classification tasks. By utilizing information from the ToC, we can better capture the topics in each book, thereby facilitating the identification of similar books. The dataset contations 2 subsets: a. 26 categories, and b. 5 general categories as detailed below:

26 categories: Anthropology, Art, Computer Science, Culture, Economics, Education, Engineering, Environment, Food, History, Humanities, Law, Life Sciences, Linguistics, Literature, Management, Mathematics, Medicine, Music, Organization, Physical Sciences, Popular works, Religion, Social Sciences, Science, Transportation

5: categories: Computer Science, Engineering, Mathematics, Medicine, Physics

Instructions:

This dataset is a set of .picle files and can be loaded in any python script or jupiter notebook as a dataframe using the following command

import pickle

//26 categories

new_data_26_cat = pickle.load(open("springer_dataframe_26_categories.p", "rb") )

//5 categories

new_data_5_cat = pickle.load(open("springer_dataframe_5_categories.p", "rb") )

Dataset Files

TOC_Springer_26_and_5_categories.zip (24.85 MB)

LOGIN TO ACCESS DATASET FILES
Open Access dataset files are accessible to all logged in users. Don't have a login? Create a free IEEE account. IEEE Membership is not required.

QUESTIONS?

Report a problem with this Dataset

Datasets

Open Access

A dataset containing the table of contents of 56K ebook titles extracted from Springer

Abstract

Dataset Files

QUESTIONS?