Stock Market Tweets Data

Citation Author(s):
Bruno
Taborda
Instituto Universitário de Lisboa (ISCTE-IUL), Lisbon, Portugal & Instituto Universitário de Lisboa (ISCTE-IUL), ISTAR, Lisbon, Portugal & CISUC - Center for Informatics and Systems of the University of Coimbra, Coimbra, Portugal
Ana
de Almeida
Instituto Universitário de Lisboa (ISCTE-IUL), Lisbon, Portugal & Instituto Universitário de Lisboa (ISCTE-IUL), ISTAR, Lisbon, Portugal & CISUC - Center for Informatics and Systems of the University of Coimbra, Coimbra, Portugal
José
Carlos Dias
Instituto Universitário de Lisboa (ISCTE-IUL), Lisbon, Portugal & Business Research Unit (BRU-IUL), Lisbon, Portugal
Fernando
Batista
Instituto Universitário de Lisboa (ISCTE-IUL), Lisbon, Portugal & INESC-ID, Lisbon, Portugal
Ricardo
Ribeiro
Instituto Universitário de Lisboa (ISCTE-IUL), Lisbon, Portugal & INESC-ID, Lisbon, Portugal
Submitted by:
Bruno Taborda
Last updated:
Thu, 05/13/2021 - 10:27
DOI:
10.21227/g8vy-5w61
Data Format:
Links:
License:
4
1 rating - Please login to submit your rating.

Abstract 

Twitter is one of the most popular social networks for sentiment analysis. This data set of tweets are related to the stock market. We collected 943,672 tweets between April 9 and July 16, 2020, using the S&P 500 tag (#SPX500), the references to the top 25 companies in the S&P 500 index, and the Bloomberg tag (#stocks). 1,300 out of the 943,672 tweets were manually annotated in positive, neutral, or negative classes. A second independent annotator reviewed the manually annotated tweets. This annotated data set can contribute to create new domain-specific lexicons or enrich some of the actual dictionaries. Researchers can train their supervised models using the annotated data set. Additionally, the full data set can be used for text mining and sentiment analysis related to the stock market.

Instructions: 

Twitter RAW data was downloaded using the Twitter REST API search, namely the "Tweepy (version 3.8.0)" Python package, which was created to make the interaction between the REST API and the developers easier. The Twitter REST API only retrieves data from the past seven days and allows to filter tweets by language. The tweets retrieved were filtered out for the English (en) language. Data collection was performed from April 9 to July 16, 2020, using the following Twitter tags as search parameter: #SPX500, #SP500, SPX500, SP500, $SPX, #stocks, $MSFT, $AAPL, $AMZN, $FB, $BBRK.B, $GOOG, $JNJ, $JPM, $V, $PG, $MA, $INTC $UNH, $BAC, $T, $HD, $XOM, $DIS, $VZ, $KO, $MRK, $CMCSA, $CVX, $PEP, $PFE. Due to the large number of data retrieved in the RAW files, it was necessary to store only each tweet's content and creation date.

 

The file tweets_labelled_09042020_16072020.csv consists of 5,000 tweets selected using random sampling out of the 943,672 sampled. Out of those 5,000 tweets, 1,300 were manually annotated and reviewed by a second independent annotator. The file tweets_remaining_09042020_16072020.csv contains the remaining 938,672 tweets.