Free dataset from news/message boards/blogs about CoronaVirus (4 month of data - 5.2M posts). The time frame of the data is Dec/2019 - March/2020. The posts are in English mentioning at least one of the following: "Covid" OR CoronaVirus OR "Corona Virus".

 

Instructions: 

The data is stored inside a zip file that contains a JSON file. Here is an example of a JSON post:

 

 

  • {
  • "organizations":[],
  • "uuid":"2b50b3f00e04fc17912154a7b88f3359db2b1ae8",
  • "thread":{
  • "social":{
  • "gplus":{
  • "shares":0
  • },
  • "pinterest":{
  • "shares":1
  • },
  • "vk":{
  • "shares":0
  • },
  • "linkedin":{
  • "shares":0
  • },
  • "facebook":{
  • "likes":19,
  • "shares":63,
  • "comments":7
  • },
  • "stumbledupon":{
  • "shares":0
  • }
  • },
  • "site_full":"www.foxnews.com",
  • "main_image":"https://cf-images.us-east-1.prod.boltdns.net/v1/static/694940094001/abd7...",
  • "site_section":"http://feeds.foxnews.com/foxnews/latest",
  • "section_title":"FOX News",
  • "url":"https://www.foxnews.com/media/dr-siegel-on-coronavirus-i-think-is-a-whop...",
  • "country":"US",
  • "domain_rank":185,
  • "title":"Dr. Marc Siegel on coronavirus: 'I think it is a whopping amount of cases undiagnosed'",
  • "performance_score":0,
  • "site":"foxnews.com",
  • "participants_count":1,
  • "title_full":"",
  • "spam_score":0.0,
  • "site_type":"news",
  • "published":"2020-03-14T04:20:00.000+02:00",
  • "replies_count":0,
  • "uuid":"2b50b3f00e04fc17912154a7b88f3359db2b1ae8"
  • },
  • "author":"Victor Garcia",
  • "url":"https://www.foxnews.com/media/dr-siegel-on-coronavirus-i-think-is-a-whop...",
  • "ord_in_thread":0,
  • "title":"Dr. Marc Siegel on coronavirus: 'I think it is a whopping amount of cases undiagnosed'",
  • "locations":[],
  • "entities":{
  • "persons":[{
  • "name":"marc siegel",
  • "sentiment":"negative"
  • },{
  • "name":"siegel",
  • "sentiment":"none"
  • },{
  • "name":"tucker carlson",
  • "sentiment":"none"
  • },{
  • "name":"trump",
  • "sentiment":"none"
  • },{
  • "name":"trump",
  • "sentiment":"none"
  • }],
  • "locations":[{
  • "name":"us",
  • "sentiment":"none"
  • }],
  • "organizations":[{
  • "name":"fox news",
  • "sentiment":"negative"
  • }]
  • },
  • "highlightText":"",
  • "language":"english",
  • "persons":[],
  • "text":"US doctors report inability to get tests for coronavirus patients Reaction from Fox News medical correspondent Dr. Marc Siegel. Dr. Marc Siegel appeared on \" Tucker Carlson Tonight \" on Friday where he gave his assessment of the coronavirus pandemic in the aftermath of President Trump declaring a national emergency. \"There's many thousands of cases that have not been diagnosed, possibly because they're mild, but it's not too late to test because we don't have another system we can work with,\" Siegel said.\nTRUMP DECLARES NATIONAL EMERGENCY OVER CORONAVIRUS, ENLISTS PRIVATE SECTOR\nSiegel also spoke about the problems hospitals and labs are having, fearing they are exposing their workers to the virus.\n\"Doctors are being told you don't see these patients. Well, we don't know what to do with them then. And the only thing we have is a test, except you can't actually do the test because the lab, and I just found out this today... they're not going to do the tests,\" Siegel said. \"Tucker, even if they have the equipment, they don't want to put their lab technicians, in my opinion, in the line of fire and be subjected to possible coronavirus.\"\nThe solution, according to the Fox News medical contributor, is to do what South Korea and facilities in Nebraska are doing -- drive-thru testing facilities.\n\"You have to have people dressed up in personal protective equipment like we showed in Nebraska. They have to be doing [it] very carefully. And it's got to be done on a high volume basis anyway,\" Siegel said. \"It can't be contained anymore. But I'll tell you why I want it done.\"\nCLICK HERE TO GET THE FOX NEWS APP\nSiegel said although the virus has already spread, testing is vital to \"reassure people who don't have it\" and \"decrease the panic.\"\n\"We have to know who has this so we can protect the people most at risk, even if it's sustained throughout all communities,\" Siegel said. \"I think it is a whopping amount of cases undiagnosed. We still need to know who has it.\" Get all the stories you need-to-know from the most powerful name in news delivered first thing every morning to your inbox Arrives Weekdays",
  • "external_links":["https://www.google.com/url","https://google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=12&cad=rja&uac...","https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=12&cad=rja..."],
  • "published":"2020-03-14T04:20:00.000+02:00",
  • "crawled":"2020-03-14T04:31:41.175+02:00",
  • "highlightTitle":""
  • }

 

Categories:
3434 Views

Network traffic analysis, i.e. the umbrella of procedures for distilling information from network traffic, represents the enabler for highly-valuable profiling information, other than being the workhorse for several key network management tasks. While it is currently being revolutionized in its nature by the rising share of traffic generated by mobile and hand-held devices, existing design solutions are mainly evaluated on private traffic traces, and only a few public datasets are available, thus clearly limiting repeatability and further advances on the topic.

Instructions: 

MIRAGE-2019 is a human-generated dataset for mobile traffic analysis with associated ground-truth, having the goal of advancing the state-of-the-art in mobile app traffic analysis.

MIRAGE-2019 takes into consideration the traffic generated by more than 280 experimenters using 40 mobile apps via 3 devices.

APP LIST reports the details on the apps contained in the two versions of the dataset.

If you are using MIRAGE-2019 human-generated dataset for scientific papers, academic lectures, project reports, or technical documents, please help us increasing its impact by citing the following reference:

Giuseppe Aceto, Domenico Ciuonzo, Antonio Montieri, Valerio Persico and Antonio Pescapè,"MIRAGE: Mobile-app Traffic Capture and Ground-truth Creation",4th IEEE International Conference on Computing, Communications and Security (ICCCS 2019), October 2019, Rome (Italy).

[ARTICLE] [BIBTEX]

Categories:
935 Views

WiFi measurements dataset for WiFi fingerprint indoor localization compiled on the first and ground floors of the Escuela Técnica Superior de Ingeniería Informática, in Seville, Spain. The facility has 24.000 m² approximately, although only accessible areas were compiled.

Instructions: 

The training dataset consists of 7175 fingerprints collected from 489 different locations. Each fingerprint is stored as a JSON object corresponding to an unique scan with the following values:

  • _id: contains an unique identifier for the fingerprint, uses to differentiate one fingerprint from another.

  • avgMagneticMagnitude: average magnetic magnitude during scanning with the mobile phone sensor, although this value is not used is provided in case it was useful.

  • location: object with the coordinates of the real world in which the sample was captured.

    • floor: number indicating the floor in which the sample was captured.

    • lat: latitude as part of the coordinate at which the sample was captured.

    • lon: longitude as part of the coordinate at which the sample was captured.

  • timestamp: UNIX timestamp in which the sample was captured.

  • userId: identifier of the user who captured the sample, this value will be anonymized so that it is not directly identifiable but remains unique.

  • wifiDevices: list of APs appearing in the sample.

    • bssid: unique AP identifier, this value will be anonymized so that it is not directly identifiable but remains unique.

    • frequency: AP WiFi frequency.

    • level: AP WiFi signal strength (RSSI).

    • ssid: AP name, this value will be anonymized so that it is not directly identifiable but can be used to compare APs with the same name.

The training dataset was compiled by taking samples at every 3 meters on average with 15 samples per location. The time at each location was approximately 40 seconds performing consecutive scans with a bq Aquaris E5 4G device using Android stock 6.0.1 without making any movements during the process. The following is an example of a fingerprint, the list of WiFi devices has been shortened to two APs, as it was too long.

{
"_id":"5cc81e8ac28d6d2533709425",
"avgMagneticMagnitude":40.615368,
"location":{
"floor":1,
"lat": 37.357746,
"lon": -5.9878354
},
"timestamp":1556618890,
"userId":"USER-0",
"wifiDevices":[
{
"bssid":"AP-BSSID-0",
"frequency":2457,
"level":-75,
"ssid":"AP-SSID-0"
},
...
{
"bssid":"AP-BSSID-23",
"frequency":2437,
"level":-64,
"ssid":"AP-SSID-6"
}
]
}

The testing dataset consists of two tests with a total of 390 samples in random locations yet in areas captured by the training dataset and with different devices. This dataset is grouped by tests and within it are the captured samples, so both the individual error and the average error can be obtained, besides recalculating this error to test different algorithms. Each test is stored as a JSON object corresponding to an unique scan with the following values:

  • _id: contains an unique identifier for the test, uses to differentiate one test from another.

  • userId: identifier of the user who performed the test, this value will be anonymized so that it is not directly identifiable but remains unique.

  • startTimestamp: UNIX timestamp that indicates when the test was started.

  • endTimestamp: UNIX timestamp that indicates when the test was ended.

  • samples: list of samples taken during testing.

    • timestamp: UNIX timestamp that indicates when the sample was collected.

    • real: object with the coordinates of the real world in which the sample was captured.

      • floor: number indicating the floor in which the sample was captured.

      • lat: latitude as part of the coordinate at which the sample was captured.

      • lon: longitude as part of the coordinate at which the sample was captured.

    • predicted: object with the predicted coordinates of the real world.

      • floor: number indicating the floor predicted.

      • lat: latitude as part of the predicted coordinate.

      • lon: longitude as part of the predicted coordinate.

    • wifiDevices: list of APs appearing in the sample.

      • bssid: unique AP identifier, this value will be anonymized so that it is not directly identifiable but remains unique.

      • frequency: AP WiFi frequency.

      • level: AP WiFi signal strength (RSSI).

      • ssid: AP name, this value will be anonymized so that it is not directly identifiable but can be used to compare APs with the same name.

    • error: approximate distance between the actual location and the predicted location.

  • error: average distance between the actual locations and the predicted locations.

The testing dataset was compiled two days after the training phase by taking samples at random locations with an average of 3 meters, performing a single scan per location. The samples were taken with two devices, which represent each of the tests individually, a bq Aquaris E5 4G device using Android stock 6.0.1 and a Xiaomi Redmi 4X using Android 7.1.2 with MIUI 10 Global 9.5.16. Before taking the sample, 5 seconds were waited without making any movements. The following is an example of a test entry, the list of samples has been shortened to one sample and wifi devices has been shortened to two APs, as it was too long.

{
"_id":"5d13245e279a550b548e3bfe",
"userId":"USER-0",
"startTimestamp": 1557212799.6555429,
"endTimestamp": 1557222705.0710876,
"samples":[
{
"timestamp":1557212799.6552203,
"real":{
"floor":0,
"lat":37.358547,
"lon":-5.9867215
},
"predicted":{
"floor":0,
"lat":37.358547,
"lon":-5.9868493
},
"wifiDevices":[
{
"bssid":"AP-BSSID-156",
"frequency":2412,
"level":-80,
"ssid":"AP-SSID-5"
},
...
{
"bssid":"AP-BSSID-146",
"frequency":2462,
"level":-36,
"ssid":"AP-SSID-6"
}
],
"error":5.233510868645419
},
...
],
"error":3.975672826048607
}

In order to provide more information about the device used in each fingerprint of the dataset, the following relationship between users and devices is given:

USER-0: Xiaomi Redmi 4X (Android 7.1.2 with MIUI 10 Global 9.5.16)

USER-1: BQ Aquaris E5 4G (Android stock 6.0.1)

Categories:
1085 Views

Code duplicates in large code corpora have adverse effects on the evaluation and use of machine learning models that rely on them. Most existing corpora suffer from this problem to some extent. This dataset contains a "duplication" index for some of the existing corpora in Big Code research. The method for collecting this dataset is described in "The Adverse Effects of Code Duplication in Machine Learning Models of Code" by Allamanis [ArXiV, to appear in SPLASH 2019].

 

Instructions: 

For each of the existing datasets, a single .json file is provided. Each JSON file has the following format:

 

[ duplicate_group_1, duplicate_group_2, ...]

 

where each duplicate group is a list of filenames of that dataset that are near duplicates.

 

For the corpora that were given as a single file (e.g. Hashimoto et al.) the line number of the original record is given.

Categories:
445 Views