Name: BillionCOV: An Enriched Billion-scale Collection of COVID-19 tweets for Efficient Hydration
Creator: Rabindra Lamsal
License: https://creativecommons.org/licenses/by/4.0/
Keywords: Machine Learning, Social Sciences, COVID-19

Abstract

BillionCOV is a global billion-scale English-language COVID-19 tweets dataset with more than 1.4 billion tweets originating from 240 countries and territories between October 2019 and April 2022. This dataset has been curated by hydrating the 2 billion tweets present in COV19Tweets.

We report that more than 500 million tweets in COV19Tweets are either deleted or protected. Avoiding the hydration of solely the deleted or protected tweets saves almost two months in a single hydration task. We provide 5 metadata to filter tweet identifiers before hydration. For instance, some use cases might need only original tweets (out of originals, retweets, quotes, and replies) or geotagged tweets. Tweet identifiers can be filtered based on the use case and hydrated using either the Hydrator app (desktop application) or twarc (command-line tool as well as a Python library). Refer to this article for more details on hydrating tweet identifiers.

Associated paper: Refer to this paper for more information (data curation, description, ethical considerations, etc.): BillionCOV: An Enriched Billion-scale Collection of COVID-19 tweets for Efficient Hydration

The six columns in each CSV represent:

Tweet identifier
Is this a reply tweet (True/False)
Is this a retweet (True/False)
Is this a quote tweet (True/False)
Is the author of the tweet verified (True/False)
country (e.g., US, AU, etc.)

Dataset usage terms:

By using this dataset, you agree to (i) use the content of this dataset for non-commercial research only, (ii) remain in compliance with Twitter's Policy, and (iii) cite the following articles:

(a) Lamsal, R., Read, M. R., & Karunasekera, S. (2023). BillionCOV: An enriched billion-scale collection of COVID-19 tweets for efficient hydration. Data in Brief, 48, 109229.

(b) Lamsal, R. (2021). Design and analysis of a large-scale COVID-19 tweets dataset. Applied Intelligence, 51, 2790-2804.

BibTeX entries:

@article{lamsal2023billioncov,
  title={BillionCOV: An Enriched Billion-scale Collection of COVID-19 tweets for Efficient Hydration},
  author={Lamsal, Rabindra and Read, Maria Rodriguez and Karunasekera, Shanika},
  journal={Data in Brief},
  volume={48},
  year={2023},
  pages={109229},
  publisher={Elsevier}
}

@article{lamsal2021design,
  title={Design and analysis of a large-scale COVID-19 tweets dataset},
  author={Lamsal, Rabindra},
  journal={Applied Intelligence},
  volume={51},
  number={5},
  pages={2790--2804},
  year={2021},
  publisher={Springer}
}

Related publications:

Rabindra Lamsal. (2021). Design and analysis of a large-scale COVID-19 tweets dataset. Applied Intelligence, 51(5), 2790-2804.
Rabindra Lamsal, Aaron Harwood, Maria Rodriguez Read. (2022). Socially Enhanced Situation Awareness from Microblogs using Artificial Intelligence: A Survey. ACM Computing Surveys, 55(4), 1-38. (arXiv)
Rabindra Lamsal, Aaron Harwood, Maria Rodriguez Read. (2022). Twitter conversations predict the daily confirmed COVID-19 cases. Applied Soft Computing, 129, 109603. (arXiv)
Rabindra Lamsal, Aaron Harwood, Maria Rodriguez Read. (2022). Addressing the location A/B problem on Twitter: the next generation location inference research. In 2022 ACM SIGSPATIAL LocalRec (pp. 1-4).
Rabindra Lamsal, Aaron Harwood, Maria Rodriguez Read. (2022). Where did you tweet from? Inferring the origin locations of tweets based on contextual information. In 2022 IEEE International Conference on Big Data (pp. 3935-3944). (arXiv)
Rabindra Lamsal, Maria Rodriguez Read, Shanika Karunasekera. (2023). BillionCOV: An Enriched Billion-scale Collection of COVID-19 tweets for Efficient Hydration. Data in Brief, 48, 109229. (arXiv)
Rabindra Lamsal, Maria Rodriguez Read, Shanika Karunasekera. (2023). A Twitter narrative of the COVID-19 pandemic in Australia. In 20th International ISCRAM Conference (pp. 353-370). (arXiv)
Rabindra Lamsal, Maria Rodriguez Read, Shanika Karunasekera. (2024). CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts. Knowledge-Based Systems, 296, 111916. (arXiv)
Rabindra Lamsal, Maria Rodriguez Read, Shanika Karunasekera. (2024). Semantically Enriched Cross-Lingual Sentence Embeddings for Crisis-related Social Media Texts. In 21st International ISCRAM Conference (in press). (arXiv)

Instructions:

The instructions to use this dataset is provided in its associated paper: https://arxiv.org/abs/2301.11284

Comments

Hi,

Thank you so much for this dataset. Could you provide more information on which dates of Twitter data are included in each data subset? That would be very helpful.

Thank you!

Submitted by Praneetha Vissa... on Thu, 06/15/2023 - 13:24

Hi Praneetha,

Apologies for the late reply. For more information on the date/time and the respective files, please refer to the associated paper (https://www.sciencedirect.com/science/article/pii/S2352340923003487).

Submitted by Rabindra Lamsal on Thu, 08/17/2023 - 05:31