Cora, Citeseer, CoAuthorCS, Polblogs and SBM

Citation Author(s):: Wang Qian
Submitted by:: Qian Wang
Last updated:: Wed, 01/22/2025 - 05:10
DOI:: 10.21227/dp41-4b96

104 views

Categories:

Keywords:

Citation network

graph neural network

Sparse Bag-of-Words Features

Node Classification

Binary Feature Vectors

Liberal/Conservative Classification

Real-World Network

ACCESS DATASET CITE

Abstract

1.Cora dataset is derived from a multi-group citation network, and the two-group subgraphs are selected for tasks such as graph neural network node classification. The dataset contains sparse Bag-of-Words feature vectors as node attributes, and the labels are mostly academic paper topic categories or fields. This subgraph focuses on the influence of graph structure and node characteristics on model prediction, which provides a reliable experimental benchmark for the research of multi-step adversarial attacks and defense strategies.

Number of nodes: (example) 652
Edges: (example) 2350
Node feature dimension: 1433
Applicable tasks: node classification, adversarial attack, graph representation learning, etc

2.Citeseer is also derived from multi-group citation networks and is similar to Cora, but differs in node distribution and feature dimensions. In this double-group subgraph, the node attributes also use sparse bag-of-words feature vectors, and the labels are mostly research topics or directions of academic papers. Because the graph structure is relatively complex, and the node feature dimensions and the number of categories are different from Cora, this dataset is often used to compare and verify the generalization and robustness of graph neural network models.

Number of nodes: (example) 852
Edges: (example) 3170
Node feature dimension: 3703
Applicable tasks: node classification, adversarial attack, citation network analysis, etc

3.CoAuthorCS comes from the two-group subgraph of the multi-group cooperation network, and each node represents the presence of keywords by a binary feature vector, which is suitable for studying the task of clustering or classification based on the presence or not of attributes. This dataset can highlight the association between node characteristics and cooperation relationships in academic networks, and provide experimental scenarios with more binary attribute characteristics for multi-step adversarial attack research.

Number of nodes: (example) 836
Edges: (example) 2270
Node feature form: binary keyword vector
Applicable tasks: node classification, cooperative relationship analysis, adversarial attack, etc

4.Polblogs is a real-world dataset that reflects a network of political blogs, with node labels corresponding to the political orientation of the blogs (e.g., liberal vs. conservative). The network structure of the dataset is usually large, and the edges represent the reference or link relationships between blogs. It is often used to analyze community division, public opinion diffusion, adversarial attacks, and so on. By treating node labels as binary classes (Liberal vs. Conservative), researchers can test the effectiveness of adversarial attacks and defense mechanisms in complex community structures.

Number of nodes: (example) 1222
Edges: (example) 16714
Tag type: Liberal/Conservative
Applicable tasks: node classification, community division, polarization research, adversarial attack, etc

5.Stochastic Block Model (SBM) is a commonly used stochastic graph model to simulate network data with community structure or block structure. The data set can be generated by random generation mechanism (such as setting the number of communities, edge probability, etc.), and the node label is often determined by the community it belongs to. SBM is often used to study community detection, group behavior simulation, and robustness under adversarial attacks, because of its high controllability and the ability to adjust the network size and structure according to requirements.

Number of nodes: (example) 1490
Edges: (example) 13790
Label type: Community division based on synthesis
Applicable tasks: community detection, random graph model research, adversarial attack simulation, etc

Instructions:

Cora, Citeseer, CoAuthorCS, Polblogs and SBM. These data can provide a unified benchmark for experiments such as multi-step adversarial attacks, GNN performance evaluation, and community detection.
Also, the added file provides detailed instructions on how to use the dataset

Datasets

Standard Dataset

Cora, Citeseer, CoAuthorCS, Polblogs and SBM

Abstract

Instructions:

Dataset Files

DOCUMENTATION

DATASET SCRIPTS

QUESTIONS?

More like this Dataset

Coronavirus (COVID-19) Tweets Dataset

Heart Disease Dataset (Comprehensive)

The FLAME dataset: Aerial Imagery Pile burn detection using drones (UAVs)

Coronavirus (COVID-19) Geo-tagged Tweets Dataset

EEG data for ADHD / Control children

Retinal Fundus Multi-disease Image Dataset (RFMiD)