MegaGeoCOV Extended

Name: MegaGeoCOV Extended
Creator: Rabindra Lamsal
License: https://creativecommons.org/licenses/by/4.0/

Citation Author(s):: Rabindra Lamsal (University of Melbourne)

Maria Rodriguez Read (University of Melbourne)

Shanika Karunasekera (University of Melbourne)
Submitted by:: Rabindra Lamsal
Last updated:: Fri, 02/24/2023 - 01:44
DOI:: 10.21227/42h1-ge40
Data Format:: CSV
Research Article Link:: A Twitter narrative of the COVID-19 pandemic in Australia

772 views

Categories:

Keywords:

Corona Tweets Dataset

COVID-19 Tweets Dataset

Corona Tweets

COVID-19 Tweets

SARS-CoV-2 Tweets Dataset

Coronavirus English Tweets Dataset

COVID-19 English Tweets Dataset

CITE

Abstract

This dataset (MegaGeoCOV Extended), which is an extended version of MegaGeoCOV, was introduced in this paper: A Twitter narrative of the COVID-19 pandemic in Australia (the paper will appear in proceedings of the 20th ISCRAM conference, Omaha, Nebraska, USA May 2023). Please refer to the paper for more details (e.g., keywords and hashtags used, descriptive statistics, etc.).

MegaGeoCOV Extended contains over 25.2 million geotagged tweets (multilingual) specific to the COVID-19 pandemic. We also provide an English-only version which has 17.8 million tweets. We used Twitter's Full-archive search endpoint for curating this dataset. A free IEEE account is sufficient to access the data files. As per Twitter's content re-distribution policy, we share tweet identifiers; the identifiers need to be hydrated to recreate the dataset locally. Hydration can be easily done with tools such as Hydrator and twarc. The dataset includes the following tweet objects for filtering the tweet identifiers: created_at, id, author.verified, author_id, geo.country, and source. Note that, after hydration, the number of tweets can vary as deleted or private tweets are not retrievable.

Dataset usage terms

By using this dataset, you agree to: (i) use the content of this dataset and the data generated from the content of this dataset for non-commercial research only, (ii) remain in compliance with Twitter's Policy and (iii) cite the following paper:

Rabindra Lamsal, Maria Rodriguez Read, Shanika Karunasekera. (2023). A Twitter narrative of the COVID-19 pandemic in Australia. arXiv preprint arXiv:2302.11136.

Instructions:

The dataset is in CSV format. Tweet identifiers can be filtered as per requirements, as we provide additional tweet objects for filtration. Consider using Hydrator or twarc for hydrating the tweet identifiers. Please refer to this paper for more details on tweet hydration: BillionCOV: An Enriched Billion-scale Collection of COVID-19 tweets for Efficient Hydration.