Datasets
Open Access
A dataset containing the table of contents of 56K ebook titles extracted from Springer
- Citation Author(s):
- Submitted by:
- Eleni Giannopoulou
- Last updated:
- Tue, 05/17/2022 - 22:21
- DOI:
- 10.21227/vjg3-em74
- Data Format:
- Link to Paper:
- License:
- Categories:
Abstract
This dataset has been created from a collection of 56403 multidisciplinary book titles from Springer, available through the Hellenic Academic Libraries Link (https://www.heal-link.gr/en/home-2/) subscription. To obtain this dataset, a parser was created for extracting relevant information, such as the title, subtitle and ToC, from each book. The extracted information was stored in a database for further processing. Each book title in the database includes information regarding the bookid, title, and ToC. As a next step, a team of librarians who were working in the NTUA Digital Library manually added the subject field information. This dataset contains the primary subject field as each book’s label. In the 5 categories sub set there is also another field that contains the secondary labels for each book in the collection
This dataset can serve as a basis for multiclass classification problems and/or content recommendation. The 5 categories subset can also be used for multilabel classification tasks. By utilizing information from the ToC, we can better capture the topics in each book, thereby facilitating the identification of similar books. The dataset contations 2 subsets: a. 26 categories, and b. 5 general categories as detailed below:
26 categories: Anthropology, Art, Computer Science, Culture, Economics, Education, Engineering, Environment, Food, History, Humanities, Law, Life Sciences, Linguistics, Literature, Management, Mathematics, Medicine, Music, Organization, Physical Sciences, Popular works, Religion, Social Sciences, Science, Transportation
5: categories: Computer Science, Engineering, Mathematics, Medicine, Physics
This dataset is a set of .picle files and can be loaded in any python script or jupiter notebook as a dataframe using the following command
import pickle
//26 categories
new_data_26_cat = pickle.load(open("springer_dataframe_26_categories.p", "rb") )
//5 categories
new_data_5_cat = pickle.load(open("springer_dataframe_5_categories.p", "rb") )
Dataset Files
- TOC_Springer_26_and_5_categories.zip (24.85 MB)
Open Access dataset files are accessible to all logged in users. Don't have a login? Create a free IEEE account. IEEE Membership is not required.