Artificial Intelligence

ChnSentiCorp

This dataset is a large-scale Chinese hotel review data set collected by Tan Songbo. The corpus size is 10,000 reviews. The corpus is automatically collected and organized from Trip.com.

Categories:: Artificial Intelligence
Machine Learning

1178 Views

A Large-Scale Dataset for Active Fire Detection/Segmentation (Landsat-8)

This dataset was created from all Landsat-8 images from South America in the year 2018. More than 31 thousand images were processed (15 TB of data), and approximately on half of them active fire pixels were found. The Landsat-8 sensor has 30 meters of spatial resolution (1 panchromatic band of 15m), 16 bits of radiometric resolution and 16 days of temporal resolution (revisit). The images in our dataset are in TIFF (geotiff) format with 10 bands (excluding the 15m panchromatic band).

Categories:: Artificial Intelligence
Computer Vision
Image Processing
Machine Learning
Remote Sensing
Geoscience and Remote Sensing
Climate Change/Environmental

6329 Views

Spoken Indian Language Identification Database

(9 languages, 8 different utterance lengths)

Languages

Assamese
Bengali
Gujarati
Hindi
Kannada
Malayalam
Marathi
Tamil
Telugu

Durations

30 sec
10 sec
5 sec
3 sec
1 sec
0.5 sec
0.2 sec
0.1 sec

Categories:: Artificial Intelligence
Digital signal processing
Machine Learning

1131 Views

GeoCoV19: A Dataset of Hundreds of Millions of Multilingual COVID-19 Tweets with Location Information

We present GeoCoV19, a large-scale Twitter dataset related to the ongoing COVID-19 pandemic. The dataset has been collected over a period of 90 days from February 1 to May 1, 2020 and consists of more than 524 million multilingual tweets. As the geolocation information is essential for many tasks such as disease tracking and surveillance, we employed a gazetteer-based approach to extract toponyms from user location and tweet content to derive their geolocation information using the Nominatim (Open Street Maps) data at different geolocation granularity levels. In terms of geographical coverage, the dataset spans over 218 countries and 47K cities in the world. The tweets in the dataset are from more than 43 million Twitter users, including around 209K verified accounts. These users posted tweets in 62 different languages.

Categories:: Artificial Intelligence
COVID-19
Machine Learning

5584 Views

Defective photonic bandgap crystals using Finite Difference Time Domain (FDTD)

This is a dataset of Finite Difference Time Domain (FDTD) simulation results of 13 defective crystals and one non-defective crystal. There are 4 fields in the dataset, namely: Real, Img, Int, and Attribute. The header real shows a real part of the simulated result, img shows the imaginary part, int gives the intensity all in superimposed form. Attribute denotes the label of a crystal simulated. The label 0 is for the simulated crystal, which is non-defective. Other 13 labels, from crystal 1 to crystal 13 are assigned to the 13 different crystals whose simulations are studied.

Categories:: Artificial Intelligence
Signal Processing

324 Views

English language tweets dataset for COVID-19

This dataset is very vast and contains tweets related to COVID-19. There are 226668 unique tweet-ids in the whole dataset that ranges from December 2019 till May 2020 . The keywords that have been used to crawl the tweets are 'corona', , 'covid ' , 'sarscov2 ', 'covid19', 'coronavirus '. For getting the other 33 fields of data drop a mail at "avishekgarain@gmail.com". Twitter doesn't allow public sharing of other details related to tweet data( texts,etc.) so can't upload here.

Categories:: Artificial Intelligence
COVID-19
Machine Learning
Other

3383 Views

COVID-19 tweets dataset for Bengali language

This dataset is very vast and contains Bengali tweets related to COVID-19. There are 36117 unique tweet-ids in the whole dataset that ranges from December 2019 till May 2020 . The keywords that have been used to crawl the tweets are 'corona', , 'covid ' , 'sarscov2 ', 'covid19', 'coronavirus '. For getting the other 33 fields of data drop a mail at "avishekgarain@gmail.com". Code snippet is given in Documentation file. Sharing Twitter data other than Tweet ids publicly violates Twitter regulation policies.

Categories:: Artificial Intelligence
COVID-19
Machine Learning
Biomedical and Health Sciences
Other

1481 Views

COVID-19 tweets dataset for Spanish language

This dataset is very vast and contains Spanish tweets related to COVID-19. There are 18958 unique tweet-ids in the whole dataset that ranges from December 2019 till May 2020 . The keywords that have been used to crawl the tweets are 'corona', , 'covid ' , 'sarscov2 ', 'covid19', 'coronavirus '. For getting the other 33 fields of data drop a mail at "avishekgarain@gmail.com". Code snippet is given in Documentation file. Sharing Twitter data other than Tweet ids publicly violates Twitter regulation policies.

Categories:: Artificial Intelligence
COVID-19
Machine Learning
Biomedical and Health Sciences
Other

1184 Views

Speech Dataset in Hindi Language

100 Speakers each consisting of 5 voice samples for training data and 1 voice sample for testing data. Total of 600 voice samples collected in different audio formats like mpeg, mp4, mp3, ogg etc. These samples were than preprocessed and converted into .wav format. Each voice sample has a time duration of 5-10 seconds due to different lengths tuning of parameters should be done before usage. Whole Dataset size is 600mb and duration is 1 hour 40 minutes. This dataset can be used for speech synthesis, speaker identification. speaker recognition, speech recogniton etc.

Categories:: Artificial Intelligence
Machine Learning

5521 Views

Speech Dataset in Hindi Language

Categories:: Artificial Intelligence
Machine Learning

2366 Views