Datasets
Open Access
MegaGeoCOV Extended
- Citation Author(s):
- Submitted by:
- Rabindra Lamsal
- Last updated:
- Thu, 02/23/2023 - 20:44
- DOI:
- 10.21227/42h1-ge40
- Data Format:
- Link to Paper:
- License:
- Categories:
- Keywords:
Abstract
This dataset (MegaGeoCOV Extended), which is an extended version of MegaGeoCOV, was introduced in this paper: A Twitter narrative of the COVID-19 pandemic in Australia (the paper will appear in proceedings of the 20th ISCRAM conference, Omaha, Nebraska, USA May 2023). Please refer to the paper for more details (e.g., keywords and hashtags used, descriptive statistics, etc.).
MegaGeoCOV Extended contains over 25.2 million geotagged tweets (multilingual) specific to the COVID-19 pandemic. We also provide an English-only version which has 17.8 million tweets. We used Twitter's Full-archive search endpoint for curating this dataset. A free IEEE account is sufficient to access the data files. As per Twitter's content re-distribution policy, we share tweet identifiers; the identifiers need to be hydrated to recreate the dataset locally. Hydration can be easily done with tools such as Hydrator and twarc. The dataset includes the following tweet objects for filtering the tweet identifiers: created_at, id, author.verified, author_id, geo.country, and source. Note that, after hydration, the number of tweets can vary as deleted or private tweets are not retrievable.
Dataset usage terms
By using this dataset, you agree to: (i) use the content of this dataset and the data generated from the content of this dataset for non-commercial research only, (ii) remain in compliance with Twitter's Policy and (iii) cite the following paper:
Rabindra Lamsal, Maria Rodriguez Read, Shanika Karunasekera. (2023). A Twitter narrative of the COVID-19 pandemic in Australia. arXiv preprint arXiv:2302.11136.
The dataset is in CSV format. Tweet identifiers can be filtered as per requirements, as we provide additional tweet objects for filtration. Consider using Hydrator or twarc for hydrating the tweet identifiers. Please refer to this paper for more details on tweet hydration: BillionCOV: An Enriched Billion-scale Collection of COVID-19 tweets for Efficient Hydration.
Dataset Files
- MegaGeoCov-EN.csv (1.34 GB)
- MegaGeoCov-MultiLingual.csv (1.95 GB)
Open Access dataset files are accessible to all logged in users. Don't have a login? Create a free IEEE account. IEEE Membership is not required.