A dataset containing the table of contents of 56K ebook titles extracted from Springer

Citation Author(s):
Eleni
Giannopoulou
National Technical University of Athens (NTUA)
Nikolaos
Mitrou
National Technical University of Athens (NTUA)
Submitted by:
Eleni Giannopoulou
Last updated:
Sat, 11/14/2020 - 13:59
DOI:
10.21227/vjg3-em74
Data Format:
License:
50 Views
Categories:
0
0 ratings - Please login to submit your rating.

Abstract 

This dataset has been created from a collection of 56403 multidisciplinary book titles from Springer, available through the Hellenic Academic Libraries Link (https://www.heal-link.gr/en/home-2/) subscription. To obtain this dataset, a parser was created for extracting relevant information, such as the title, subtitle and ToC, from each book. The extracted information was stored in a database for further processing. Each book title in the database includes information regarding the bookid, title, and ToC. As a next step, a team of librarians who were working in the NTUA Digital Library manually added the subject field information. This dataset contains the primary subject field as each book’s label. In the 5 categories sub set there is also another field that contains the secondary labels for each book in the collection

This dataset can serve as a basis for multiclass classification problems and/or content recommendation. The 5 categories subset can also be used for multilabel classification tasks. By utilizing information from the ToC, we can better capture the topics in each book, thereby facilitating the identification of similar books. The dataset contations 2 subsets: a. 26 categories, and b. 5 general categories as detailed below:

26 categories: Anthropology, Art, Computer Science, Culture, Economics, Education, Engineering, Environment, Food, History, Humanities, Law, Life Sciences, Linguistics, Literature, Management, Mathematics, Medicine, Music, Organization, Physical Sciences, Popular works, Religion, Social Sciences, Science, Transportation

5: categories: Computer Science, Engineering, Mathematics, Medicine, Physics

 

 

 

Instructions: 

This dataset is a set of .picle files and can be loaded in any python script or jupiter notebook as a dataframe using the following command

import pickle

//26 categories

new_data_26_cat = pickle.load(open("springer_dataframe_26_categories.p", "rb") )

//5 categories

new_data_5_cat = pickle.load(open("springer_dataframe_5_categories.p", "rb") )