This dataset includes 24,201,654 tweets related to the US Presidential Election on November 3, 2020, collected between July 1, 2020, and November 11, 2020. The related party name and sentiment scores of tweets, also the words that affect the score were added to the data set.


The dataset contains more than 20 million tweets with 11 different attributes of each of them. The data file is in comma-separated values (CSV) format and its size is 3,48 GB. It is zipped by WinRAR to upload and download easily. It is zipped file size is 766 MB. It contains the following information (11 Column) for each tweet in the data file:

Created-At: Exact creation time of the tweet [Jul 1, 2020 7:44:48 PM– Nov 12, 2020 5:47:59 PM]
From-User-Id: Unique ID of the user that sent the tweet
To-User-Id: Unique ID of the user that tweet sent to
Language: Language of tweets that are coded in ISO 639-1. [%90 of tweets en: English; %3,8 und: Unidentified; %2,5 es: Spanish].
Retweet-Count: number of retweets
PartyName: The Label showing which party the tweeting is about. [Democrats] or [Republicans] if the tweet contains any keyword (that are given above) related to the Democratic or Republican party. If it contains keywords about two parties then the label is [Both]. If it doesn’t contain any keyword about two major parties (Democratic or Republican) that the label is [Neither].
Id: Unique ID of the tweet
Score: The sentiment score of the tweets. A positive (negative) score means positive (negative) emotion.
Scoring String: Nominal attribute with all words taking part in the scoring
Negativity: The sum of negative components
Positivity: The sum of positive components

The VADER algorithm is used for sentiment analysis of tweets. The VADER (Valence Aware Dictionary and sEntiment Reasoner) lexicon and rule-based sentiment algorithm to score a text. it is specifically attuned to sentiments expressed in social media and produces scores based on a dictionary of words. This operator calculates and then exposes the sum of all sentiment word scores in the text. For more details about this algorithm:

This data can be used for developing election result prediction methods by social media. Also, It can be used in text mining studies such as understanding the change of feelings in tweets about parties; determining the topics that cause positive or negative feelings about the candidates; to understand the main issues that Twitter users concern about the USA election.


Modern science is build on systematic experimentation and observation.  The reproducibility and replicability of  the experiments and observations are central to science. However, reproducibility and replicability are not always guaranteed, sometimes referred to as 'crisis of reproducibility'. To analyze the extent of the crisis, we conducted a survey on the state of reproducibility in remote sensing. This survey was conducted as an online survey. The answers of the respondents are saved in this dataset in full-text CSV format.


The file contains the answers to our online survey on reproducibility in remote sensing. The format is as comma-separated values (CSV) in full-text, i.e. the answers are saved in the full-text instead of numbers, allowing to easily understand and analyse.


The dataset also includes the report given from the website the survey was hosted on ( This can be used for a quick overview of the results, but also to see the original quesetions and the possible answers. 


Reddit is one of the largest social media websites in the world and it contains valuable data about its users and their perspectives organized into virtual communities or subreddits, based on common areas of interest.  Substance use issues are particularly salient within this online community due to the burgeoning substance use (opioid) crisis within the United States among other countries.  A particularly important location for understanding user perceptions of opioids is the Philadelphia, Pennsylvania, USA region, due to the prevalence associated with overdose deaths.  To collect user sen


Included is the dataset in a CSV file, data dictionary for all variables (column key) in a text file, keyword list used to query the Reddit API in a text file, and the targeted subreddit list in a text file. The dataset comprises entries (submissions, comments) that had keyword query results within targeted subreddits.  The dataset includes designations for submissions and comments within the data dictionary; submission denotes the first order entry within a subreddit, comment denotes entries that are posted in response to submissions or other comments. Rows include all potential entries within the targeted subreddits from January 1, 2005 – May 14, 2020.  


There are 56,979 rows of data in the CSV file.


This dataset is offered as .csv and is part of 3 files which are:

- File 1: has all 1699 arabic news headlines colllected with the corresponding emotion classification that 3 annotators agreed on with no bias

- File 2: has the dataset with BOW features extracted

- File 3: has the dataset with n-gram features extracted


While social media has been proved as an exceptionally useful tool to interact with other people and massively and quickly spread helpful information, its great potential has been ill-intentionally leveraged as well to distort political elections and manipulate constituents. In the paper at hand, we analyzed the presence and behavior of social bots on Twitter in the context of the November 2019 Spanish general election.


Data have been exported in three formats to provide the maximum flexibility:

  • MongoDB Dump BSONs
    • To import these data, please refer to the official MongoDB documentation.
  • JSON Exports
    • Both the users and the tweets collections have been exported as canonical JSON files. 
  • CSV Exports (only tweets)
    • The tweet collection has been exported as plain CSV file with comma separators.

Socio-economic, demogaphic and protest data of South Africa's 345 Local Municipalties.





Muni - - Name of South African municipality in 2011


Change in Population - - % change in population between 2001 and 2011


% Unemployed  - - Level of unemployment using official definition (2011)


% Poor - - Poverty percentage at household level (2011)


Voter Turnout - -% of eligible voters voting in the 2009 National elections (2011)


Mean Age - - Average age of people living in municipality (2011)


Small Households - - Percentage of households with 1 or 2 people (2011)


Large Households - - Percentage of households with 4 or more people (2011)


Percentage Youth - - Percentage of population between 16 and 35 (inclusive) (2011)


Dep Ratio - - Non working population (by ages, not by employment status) as percentage of total population. Here, "non working" = below 18 or over 60. (2011)


% White - - Percentage of population self-identifying as white (2011)


% HWW  - - Percentage of households without (reticulated/municipally piped) water (2011)


R-P Ratio - - Ratio of richest people to poorest people  (2011)


Gini - - Municipal Gini coefficient (2011)


% PWM - - Percentage of population with a "Matric" (Grade 12) certificate (2011)


% NSL - - Percentage of the population self-reporting their home language to be other than one of South Africa's 11 National Languages  (2011)


% Tribal - - Percentage of households living in areas classified as "tribal" by official "geographic type"  (2011)


% Male - - Percentage of the population that self-identifies as male (2011)


% Informal - - Percentage of households that are classified as informal (structure types) (2011)


% Rural - - Percentage of households that are classified as rural (2011)


Urban =1 - - One-hot/dummy variable - 1 = municipality is classfied as either "Metropolitan" or "Secondary City" (2011)


Urban - - Municipal classification


COGTA Score - - 4-level score of municipal efficiency (running good to bad) (2011)


Percentage_Poor - - Individual level poverty (best avoided)


AG Rating - - 5-level score of municipal governance by the South African Auditor-General (running good to bad) (2011)


Prov - - 9 provinces, numeric


Prov2 - - 9 provinces, names


Pop - - municipal population (2011)


Target 1 - -  Count of protests 1997-2013


TargetLog - - Log10 of Target1


TagetNatLog - - LogE of Target1


X Values - - Safely ignore


Y Values - - Safely Ignore


Target 2 - - Number of protests per capita (2009-2013)


2NatLog - - LogE of Target 2


NatLog7 - - LogE of count of protests*crowd-size of protests / capita (2009-2013)


Target10 - -  Turmoil of protests (0 - 1, where 1 = 100% violent) (2009-2013)


13NatLog - - LogE of count of protests*crowd-size of protests / capita (2009-2013) only considering community protests


15NatLog - - LogE of count of protests*crowd-size of protests / capita (2009-2013) only considering Labour-related protests



This simulated combat reports dataset combines fictional headings, reporting units, and attack times with real data from 551 records of terrorist attacks in Afghanistan (2009–2010) [1]. The dataset combines selected attributes from the DA Form 1594 [2] and U.S. Army Spot Report [3]. The dataset also includes additional attributes for tactical context.


Along with the increasing use of unmanned aerial vehicles (UAVs), large volumes of aerial videos have been produced. It is unrealistic for humans to screen such big data and understand their contents. Hence methodological research on the automatic understanding of UAV videos is of paramount importance.


=================  Authors  ===========================

Lichao Mou,

Yuansheng Hua,

Pu Jin,

Xiao Xiang Zhu,


=================  Citation  ===========================

If you use this dataset for your work, please use the following citation:


  title= {{ERA: A dataset and deep learning benchmark for event recognition in aerial videos}},

  author= {Mou, L. and Hua, Y. and Jin, P. and Zhu, X. X.},

  journal= {IEEE Geoscience and Remote Sensing Magazine},

  year= {in press}



==================  Notice!  ===========================

This dataset is ONLY released for academic uses. Please do not further distribute the dataset on other public websites.


Behavioral traits for 115 employees in public buildings, namely sensitivity to personal comfort loss, desire for conformance to social norms, desire for teaming, desire for rewards. Data has been collected in the pilot studies of H2020 ChArGED for gamified energy conservation in public buildings (grant agreement no. 696170). 


The dataset comprises raw data to validate methods for reliable data collection. We proposed the data collection methods in a path to assess digital healthcare apps. To validate the methods, we conducted experiments in Amazon Mechanical Turk (MTurk), and then we showed that the methods have a significant meaning based on statistical tests.