Avishek Garain's picture
Congratulations!  You have been automatically subscribed to IEEE DataPort and can access all datasets on IEEE DataPort!
First Name: 
Avishek
Last Name: 
Garain
Affiliation: 
Department of Computer Science and Engineering, Jadavpur University
Job Title: 
Undergraduate Research Assistant
Expertise: 
Natural Language Processing and Deep Learning

Datasets & Analysis

Optical Character Recognition (OCR) system is used to convert the document images, either printed or handwritten, into its electronic counterpart. But dealing with handwritten texts is much more challenging than printed ones due to erratic writing style of the individuals. Problem becomes more severe when the input image is doctor's prescription. Before feeding such image to the OCR engine, the classification of printed and handwritten texts is a necessity as doctor's prescription contains both handwritten and printed texts which are to be processed separately.

Categories:
470 Views

Most text-simplification systems require an indicator of the complexity of the words. The prevalent approaches to word difficulty prediction are based on manual feature engineering. Using deep learning based models are largely left unexplored due to their comparatively poor performance. We have explored the use of one of such in predicting the difficulty of words. We have treated the problem as a binary classification problem. We have trained traditional machine learning models and evaluated their performance on the task.

Instructions: 

The data is in CSV format. Please check the research paper for obtaining the difficulty label from the I_Z score.

Categories:
1754 Views

This dataset contains 1.65 lakhs tweet ids related to death of Sushant Singh Rajput in English language. For whole dataset with all other fields drop a mail at avishekgarain@gmail.com.

Instructions: 

The file contains tweet ids. If anyone wants to hydrate the ids he/she may do so or you can mail at avishekgarain@gmail.com for drive link.

Categories:
243 Views

This dataset is very vast and contains tweets related to COVID-19. There are 226668 unique tweet-ids in the whole dataset that ranges from December 2019 till May 2020 . The keywords that have been used to crawl the tweets are 'corona',  ,  'covid ' , 'sarscov2 ',  'covid19', 'coronavirus '.  For getting the other 33 fields of data drop a mail at "avishekgarain@gmail.com". Twitter doesn't allow public sharing of other details related to tweet data( texts,etc.) so can't upload here.

Instructions: 

Read the documentation properly and use the code snippet written in python to load data.

Categories:
1720 Views

This dataset is very vast and contains Bengali tweets related to COVID-19. There are 36117 unique tweet-ids in the whole dataset that ranges from December 2019 till May 2020 . The keywords that have been used to crawl the tweets are 'corona',  ,  'covid ' , 'sarscov2 ',  'covid19', 'coronavirus '.  For getting the other 33 fields of data drop a mail at "avishekgarain@gmail.com". Code snippet is given in Documentation file. Sharing Twitter data other than Tweet ids publicly violates Twitter regulation policies.    

Instructions: 

The script to load data is written in documentation.

Categories:
595 Views

This dataset is very vast and contains Spanish tweets related to COVID-19. There are 18958 unique tweet-ids in the whole dataset that ranges from December 2019 till May 2020 . The keywords that have been used to crawl the tweets are 'corona',  ,  'covid ' , 'sarscov2 ',  'covid19', 'coronavirus '.  For getting the other 33 fields of data drop a mail at "avishekgarain@gmail.com". Code snippet is given in Documentation file. Sharing Twitter data other than Tweet ids publicly violates Twitter regulation policies.    

Instructions: 

Use the code snippet provided written in python to load data.

Categories:
333 Views

Each voice sample is stored as a .WAV file, which is then pre-processed for acoustic analysis using the specan function from the WarbleR R package. Specan measures 22 acoustic parameters on acoustic signals for which the start and end times are provided.

The output from the pre-processed WAV files were saved into a CSV file, containing 3168 rows and 21 columns (20 columns for each feature and one label column for the classification of male or female).

Categories:
276 Views

The dataset consists of reviews for various hotels throughout the world and data columns range from Location, Trip Type to various parameters of reviewing with individual review score. The data can be preprocessed and used for various purposes ranging from review categorization, topic extraction, sentiment analysis, location based quality calculation etc. Trustworthy real world data comes handy now-a-days and is tough to get a grasp on. So this dataset will be a good contribution for the researcher community as well as professionals. 

 

Categories:
5811 Views