Datasets
Standard Dataset
Information Diffusion Dataset on Twitter with User Tweets
- Citation Author(s):
- Submitted by:
- Zejian Wang
- Last updated:
- Sun, 12/03/2023 - 03:55
- DOI:
- 10.21227/99dz-f923
- Data Format:
- License:
- Categories:
- Keywords:
Abstract
We looked at 10,269 users on Twitter and collected their tweets and the follower network from April 2019 to October 2019. We organized tweets with the same hashtag into 29,192 cascades. To find an active community, we first selected 500 popular seed users. Subsequently, we added users who followed these seed users to the target group. After adding more users iteratively for five rounds, we locked the target group.
The hashtag is considered as the keyword for each cascade. We filter out cascades with less than 2 or more than 500 participants and those inactive users who appear in less than three different cascades. Besides, in this dataset, we only keep the cascades with hashtags longer than five characters, because some long hashtags placed at the end of tweets may be truncated due to the length limit of the Twitter API. In addition, this dataset also discards cascades with a hashtag containing special characters, such as emojis. The Twitter dataset will be made available on IEEE DataPort after the publication of the manuscript.
### Description of the data
The dataset includes three parts:
1. Static follower network
2. Retweet behaviours without retweeting content
3. Retweet behaviours with retweeting content
We will introduce each part in details.
1. Static follower network
- Filename: deg_le483_subgraph.data
- File format: the file contains the follower network with the library `pickle` in python
- Usage: how to read the follower network from the file and how to iterate over the nodes or edges with the library `pickle` in python
```
dirpath = ''
graph_filename = 'deg_le483_subgraph.data'
with open(dirpath + '/' + graph_filename, 'rb') as f:
graph = pickle.load(f)
for node in graph.vs:
print(node.index, node['label'])
for edge in graph.es:
source_id, target_id = edge.source, edge.target
source_vertex = graph.vs[source_id]
target_vertex = graph.vs[target_id]
assert source_vertex.index == source_id
# using get_eid() you can do the opposite:
same_edge_id = graph.get_eid(source_id, target_id)
same_edge = graph.es[same_edge_id]
assert edge.index == same_edge_id
```
2. Retweet behaviours without retweeting content
- Filename: cascade_dict.data
- File format: the file contains the retweeting records with the library `pickle` in python. The retweeting records are organized as key-value dicts. Their cascade hashtags are served as the keys, and the records (containing both user ids and activation time) are served as the values. Similarly, the records are organized as another key-value dicts, which have two keys named as `user` and `ts`. The values of these records are lists of different activations. The activation time is in the format of unix timestamps.
```
#antifadomesticterrorists: {
user: [32839, 16621, 44225, 421, 44397, 26952, 12685, 23522, 17801, 44134],
ts: [1562135807, 1562495385, 1564994067, 1566269187, 1568032965, 1568321964, 1569850925, 1571147642, 1571572412, 1572695879],
}
```
- Usage: how to read the retweeting records from the file and how to iterate over the records with the library `pickle` in python
```
dirpath = ''
cascade_dict_filename = 'cascade_dict.data'
with open(dirpath + '/' + cascade_dict_filename, 'rb') as f:
cascade_dict = pickle.load(f)
for tag, cascade in cascade_dict.items():
print(tag)
print(cascade['user'][:10])
print(cascade['ts'][:10])
```
3. Retweet behaviours with retweeting content
- Filename: cascadewithcontent_dict.data
- File format: the file contains the retweeting records with the library `pickle` in python. Different from the above file, this file additionally contains the retweeting content. The retweeting records are also organized as key-value dicts. Their cascade hashtags are served as the keys, and the records (containing both user ids, activation time and tweets) are served as the values. Similarly, the records are organized as another key-value dicts, which have three keys named as `user`, `ts`, `content`. The values of these records are lists of different activations. The activation time is in the format of unix timestamps.
```
#antifadomesticterrorists: {
user: [32839, 16621, 44225, 421, 44397],
ts: [1562135807, 1562495385, 1564994067, 1566269187, 1568032965],
content: ['_retweet_ heres a good look at the #portland #antifascist #antifadomesticterrorists criminals help bring them to justice https:',
'@user @user @user @user are you serious? #antifaterrorists #antifadomesticterrorists #antifaterroristorganization',
'_retweet_ i really believe this is a #deepstate conspiracy ..the leftists #antifadomesticterrorists is orchestrating these tragedies, t',
'_retweet_ @user antifa getting knocked on their lazy butts is better in slow motion and with music.#antifadomesticterrorists https:/',
'_retweet_ this is terrific! good on #bostonpd enough of #antifadomesticterrorists _httpurl_'],
}
```
- Usage: how to read the retweeting records from the file and how to iterate over the records with the library `pickle` in python
```
dirpath = ''
cascadewithcontent_dict_filename = 'cascadewithcontent_dict.data'
with open(dirpath + '\\' + cascadewithcontent_dict_filename, 'rb') as f:
cascadewithcontent_dict = pickle.load(f)
for tag, cascade in cascadewithcontent_dict.items():
print(tag)
print(cascade['user'][:10])
print(cascade['ts'][:10])
print(cascade['content'][:10])
```
## Online Repository link
* [TMGNN](https://github.com/william-wang-stu/TMGNN) - Link to the code repository.
## Authors
* **Huangxin Zhuang** - *Main Contributor*
* **Zejian Wang** - *Contributor*
* **Yichao Zhang** - *Contributor*
## Citations
TBD