Datasets
Standard Dataset
Large-scale Statistical Keyword Co-occurrence Network

- Citation Author(s):
- Submitted by:
- Tianang Deng
- Last updated:
- Mon, 04/21/2025 - 09:35
- DOI:
- 10.21227/mwd0-h627
- Data Format:
- License:
- Categories:
- Keywords:
Abstract
We collect metadata including published year and keywords for 84,725 papers published in 42 statistical journals from 1992 to 2021 from the Web of Science (www.webofscience.com). After combining different expressions of the same keyword and filtering out keywords with low frequency, we finally obtain 5,037 keywords. Multiple keywords co-exist within a paper, and this co-occurrence relationship can be utilized to construct the keyword co-occurrence network. Specifically, the nodes represent keywords, and the edges are the co-occurrence relationships between keywords, leading to an undirected and weighted network.
In this dataset, an edgelist of the keyword co-occurrence network is provided. The dataset contains 343,919 rows and 4 columns including "node1", "node2", "year" and "weight". Since the network is undirected, the columns "node 1" and "node 2" store the two vertices of each edge, sorted in alphabetical order. This can help to sum the edge weights for the same node pair across different years. The "weight" column represents the number of papers published in a given year that contain both keywords connected by the edge, while the "year" column indicates the corresponding year.
Users can load this dataset in Python or R and use third-party libraries to construct the keyword co-occurrence network. For analysis specific to keyword co-occurrence network in a particular year, the data can be filtered based on the "year" column. To analyze the network for a given time period, users can sum the "weight" values for rows with the same "node1" and "node2" within the specified years, and then build the network. Users can perform descriptive analysis and community detection on the network to explore the main research topics in the statistical field over time and the relationships between them.