RiPAMI

Citation Author(s):
Penghai
Zhao
Nankai University
Submitted by:
Penghai Zhao
Last updated:
Mon, 12/16/2024 - 05:09
DOI:
10.21227/c5yg-ys33
Data Format:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

To ensure reproducible experiments and prevent overburdening the server, we construct an SQL-based database dubbed RiPAMI (Reviews in Pattern Analysis and Machine Intelligence, pronounced as \textipa{/ri:p\ae mi/}). This database stores information related to the paper such as title, abstract, date of publication, venue, citation counts, and reference details, etc. From initial keyword selection to the final SQL-based RiPAMI snapshot, three key steps are implemented to ensure the data in RiPAMI is clean, accurate, and reliable. For simplicity, the polar diagram shows only a subset of the keywords used for retrieval. Since citation counts fluctuate over time, the retrieval date for related information is set to October 2024.

Instructions: 

We implement a structured process to construct the RiPAMI database, ensuring that the selected articles are review papers relevant to the pattern recognition field. The process begins with searching with keywords derived from the scopes of relevant journals and conferences. These keywords are then formatted into query strings for the arXiv API as follows: (ti:``review'' OR ti:``survey'') AND (ti:``{keyword.lower()}" OR abs:``{keyword.lower()}''). This structure specifically retrieves articles whose titles or abstracts include both the terms ``review'' or ``survey'' and the specified keywords. Additionally, we apply regular expressions to check that the abstracts of returned articles include the relevant keywords, ensuring that only appropriate review articles enter the RiPAMI database. To avoid potential copyright and licensing issues, papers are retrieved and downloaded using the arXiv's API. Calls to the API are made by means of HTTP requests to a certain URL. The responses will be parsed and stored in an SQL-based database. Through this process, we collect a total of 4,106 samples. To further refine our database, we employ a double-checking phase. This involves a GPT-based filter to preliminarily classify papers, followed by a manual validation step to remove any noisy or irrelevant entries. Lastly, we supplement the remaining 3099 entries with additional metadata, including the publication date, citation counts, and reference details. As mentioned in Sec.~\ref{sec_data_src}, we suggest enriching the meta-data of papers by leveraging a combination of disparate data sources. Considering the potential legal risks of crawling to obtain academic data from Google Scholar, Semantic Scholar API was employed to obtain additional meta-data, such as citation and reference details which are not provided by the arXiv API. The final SQL-based snapshot encompasses a wide range of data, including ID, title, publication date, citations, and more, facilitating efficient data retrieval and statistical analysis.