Datasets
Standard Dataset
CIRDC
- Citation Author(s):
- Submitted by:
- Yuhang Zhang
- Last updated:
- Wed, 09/18/2024 - 09:39
- DOI:
- 10.21227/6514-ay49
- Data Format:
- License:
- Categories:
- Keywords:
Abstract
The IEEE Xplore database is vital in democratizing access to high-quality research datasets, fostering global collaboration, and promoting interdisciplinary studies. Insights from the IEEE Xplore database support applications in academic collaboration networks, predictive research trends, recommendation systems, and the evolution of scientific discourse. Our cirdc dataset extracts key information of all articles in the IEEE Xplore database using web data mining methods. Source codes and scripts for data collection are provided to promote transparency and reproducibility. For a comprehensive description of the database, please refer to the article: Y. Zhang, Y. Li, S. Makonin, and R. Kumar. Descriptor: Comprehensive IEEE Research Data Collections (CIRDC). IEEE Data Description.
Database Structure
The database is in CIRDC folder. Each directory in CIRDC represents the publication number of a journal or conference, named by publication number. Each sub-directory in publication number directory contains multiple JSON files named by year.json. Each JSON file includes the information of all the papers for that journal/conference and for that year. An example is shown as follows, where 10 and 100 indicate publication number. 1964 and 1965 indicate year.
CIRDC/
├──10
├── 1964.json
├── 1965.json
├── ...
├──100
├── ...
├── ...
Data File Structure
Each JSON file contains a list, and each entity in the list corresponds to the metadata of a paper. The paper metadata includes:
publicationNumber (Identifier for the journal/conference), doi (Digital Object Identifier of the paper), publicationYear (Year the paper was published), publicationDate (Full date of publication), articleNumber (A unique number assigned to the paper), articleTitle (Title of the paper), volume (Volume number), issue (Issue number), startPage (Starting page number), endPage (Ending page number), publisher (Name of the publisher), articleContentType (Type of the paper, i.e., journal, conference, magazine, or early access article), publicationTitle (Name of journal/conference), and authors (A list of authors). Each author entry in the authors field contains the following data: id (ID number of the author in IEEE system), preferredName (Full name of the author), firstName (First name of the author), and lastName (Last name of the author).
Publication Number Index
The publication_number_index.csv file provides an easy-to-navigate index of publication numbers, allowing users to quickly look up and cross-reference the corresponding publication number for specific journals and conferences by their names.
Scripts for Data Collection
The scripts for collecting CIRDC are in the scripts folder. As the maximum number of entries returned in a single query is restricted to 10,000 in IEEE Xplore, the collection involves a two-stage process. The first stage is to collect the publication number of all the journals and conferences. The second stage is to collect the data based on the publication number on a year-by-year process. As the search results are returned on multiple pages, we handle each page sequentially.
Follow the steps below to collect the data:
1. Run mkdir tmp.
2. Run get_journal_info.py and get_conference_info.py. These scripts are to download all journal and conference information. This will generate temporary folders json_conference_year and json_journal_year.
Run get_all_publication_pubnumber.py. This will process the downloaded conference and journal information to collect all publication numbers in temporary files all_journals.json and all_conferences.json.
3. Run download_journal_paper_info.py and download_conference_paper_info.py. This will download the data of IEEE Xplore papers based on the publication numbers to download_source_json folder.
4. Run post_process.py. This will conduct post-processing for the downloaded json files.
The intermediate files generated during the process are saved in the tmp folder. The final output will be saved in processed_json folder.
Dependencies
The scripts are tested using Python3.6. The following libraries are used. requests (2.27.1) library is required. Other versions could also work but haven't been tested.
License
This repository is licensed under the terms of the Creative Commons Attribution 4.0 International License.
Comments
Initial commit