BillionCOV: An Enriched Billion-scale Collection of COVID-19 tweets for Efficient Hydration

Citation Author(s):
Rabindra
Lamsal
University of Melbourne
Maria
Rodriguez Read
University of Melbourne
Shanika
Karunasekera
University of Melbourne
Submitted by:
Rabindra Lamsal
Last updated:
Sun, 05/19/2024 - 21:35
DOI:
10.21227/871g-yp65
Data Format:
Research Article Link:
Links:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

BillionCOV is a global billion-scale English-language COVID-19 tweets dataset with more than 1.4 billion tweets originating from 240 countries and territories between October 2019 and April 2022. This dataset has been curated by hydrating the 2 billion tweets present in COV19Tweets.

We report that more than 500 million tweets in COV19Tweets are either deleted or protected. Avoiding the hydration of solely the deleted or protected tweets saves almost two months in a single hydration task. We provide 5 metadata to filter tweet identifiers before hydration. For instance, some use cases might need only original tweets (out of originals, retweets, quotes, and replies) or geotagged tweets. Tweet identifiers can be filtered based on the use case and hydrated using either the Hydrator app (desktop application) or twarc (command-line tool as well as a Python library). Refer to this article for more details on hydrating tweet identifiers. 

Associated paper: Refer to this paper for more information (data curation, description, ethical considerations, etc.): BillionCOV: An Enriched Billion-scale Collection of COVID-19 tweets for Efficient Hydration

The six columns in each CSV represent:

  • Tweet identifier
  • Is this a reply tweet (True/False)
  • Is this a retweet (True/False)
  • Is this a quote tweet (True/False)
  • Is the author of the tweet verified (True/False)
  • country (e.g., US, AU, etc.)

Dataset usage terms:

By using this dataset, you agree to (i) use the content of this dataset for non-commercial research only, (ii) remain in compliance with Twitter's Policy, and (iii) cite the following articles:

(a) Lamsal, R., Read, M. R., & Karunasekera, S. (2023). BillionCOV: An enriched billion-scale collection of COVID-19 tweets for efficient hydration. Data in Brief, 48, 109229.

(b) Lamsal, R. (2021). Design and analysis of a large-scale COVID-19 tweets dataset. Applied Intelligence, 51, 2790-2804.

BibTeX entries:

@article{lamsal2023billioncov,
  title={BillionCOV: An Enriched Billion-scale Collection of COVID-19 tweets for Efficient Hydration},
  author={Lamsal, Rabindra and Read, Maria Rodriguez and Karunasekera, Shanika},
  journal={Data in Brief},
  volume={48},
  year={2023},
  pages={109229},
publisher={Elsevier}
}
@article{lamsal2021design,
  title={Design and analysis of a large-scale COVID-19 tweets dataset},
  author={Lamsal, Rabindra},
  journal={Applied Intelligence},
  volume={51},
  number={5},
  pages={2790--2804},
  year={2021},
  publisher={Springer}
}

Related publications:

Instructions: 

The instructions to use this dataset is provided in its associated paper: https://arxiv.org/abs/2301.11284

Comments

Hi,

Thank you so much for this dataset. Could you provide more information on which dates of Twitter data are included in each data subset? That would be very helpful.

Thank you!

Submitted by Praneetha Vissa... on Thu, 06/15/2023 - 13:24

Hi Praneetha,

Apologies for the late reply. For more information on the date/time and the respective files, please refer to the associated paper (https://www.sciencedirect.com/science/article/pii/S2352340923003487).

Submitted by Rabindra Lamsal on Thu, 08/17/2023 - 05:31