Avishek Garain

Get New AWS Access Key

First Name

Avishek

Last Name

Garain

Affiliation

Department of Computer Science and Engineering, Jadavpur University

Job Title

Undergraduate Research Assistant

Expertise

Natural Language Processing and Deep Learning

Dataset Entries from this Author

Dataset for classification of handwritten and printed text in a Doctor's prescription

Optical Character Recognition (OCR) system is used to convert the document images, either printed or handwritten, into its electronic counterpart. But dealing with handwritten texts is much more challenging than printed ones due to erratic writing style of the individuals. Problem becomes more severe when the input image is doctor's prescription. Before feeding such image to the OCR engine, the classification of printed and handwritten texts is a necessity as doctor's prescription contains both handwritten and printed texts which are to be processed separately.

Categories:

Gender Recognition from Voice

Each voice sample is stored as a .WAV file, which is then pre-processed for acoustic analysis using the specan function from the WarbleR R package. Specan measures 22 acoustic parameters on acoustic signals for which the start and end times are provided.

The output from the pre-processed WAV files were saved into a CSV file, containing 3168 rows and 21 columns (20 columns for each feature and one label column for the classification of male or female).

Categories:

Hotel Reviews from around the world with Sentiment Values and Review Ratings in different Categories for Natural Language Processing

The dataset consists of reviews for various hotels throughout the world and data columns range from Location, Trip Type to various parameters of reviewing with individual review score. The data can be preprocessed and used for various purposes ranging from review categorization, topic extraction, sentiment analysis, location based quality calculation etc. Trustworthy real world data comes handy now-a-days and is tough to get a grasp on. So this dataset will be a good contribution for the researcher community as well as professionals.

Categories:

Open Access Entries from this Author

Dataset for Word Difficulty Prediction

Most text-simplification systems require an indicator of the complexity of the words. The prevalent approaches to word difficulty prediction are based on manual feature engineering. Using deep learning based models are largely left unexplored due to their comparatively poor performance. We have explored the use of one of such in predicting the difficulty of words. We have treated the problem as a binary classification problem. We have trained traditional machine learning models and evaluated their performance on the task.

Categories:

Tweets related to Death of Sushant Singh Rajput

This dataset contains 1.65 lakhs tweet ids related to death of Sushant Singh Rajput in English language. For whole dataset with all other fields drop a mail at avishekgarain@gmail.com.

Categories:

English language tweets dataset for COVID-19

This dataset is very vast and contains tweets related to COVID-19. There are 226668 unique tweet-ids in the whole dataset that ranges from December 2019 till May 2020 . The keywords that have been used to crawl the tweets are 'corona', , 'covid ' , 'sarscov2 ', 'covid19', 'coronavirus '. For getting the other 33 fields of data drop a mail at "avishekgarain@gmail.com". Twitter doesn't allow public sharing of other details related to tweet data( texts,etc.) so can't upload here.

Categories:

COVID-19 tweets dataset for Bengali language

This dataset is very vast and contains Bengali tweets related to COVID-19. There are 36117 unique tweet-ids in the whole dataset that ranges from December 2019 till May 2020 . The keywords that have been used to crawl the tweets are 'corona', , 'covid ' , 'sarscov2 ', 'covid19', 'coronavirus '. For getting the other 33 fields of data drop a mail at "avishekgarain@gmail.com". Code snippet is given in Documentation file.

Categories:

COVID-19 tweets dataset for Spanish language

This dataset is very vast and contains Spanish tweets related to COVID-19. There are 18958 unique tweet-ids in the whole dataset that ranges from December 2019 till May 2020 . The keywords that have been used to crawl the tweets are 'corona', , 'covid ' , 'sarscov2 ', 'covid19', 'coronavirus '. For getting the other 33 fields of data drop a mail at "avishekgarain@gmail.com". Code snippet is given in Documentation file.

Categories:

Dataset Entries from this Author

Dataset for classification of handwritten and printed text in a Doctor's prescription

Gender Recognition from Voice

Hotel Reviews from around the world with Sentiment Values and Review Ratings in different Categories for Natural Language Processing

Open Access Entries from this Author

Dataset for Word Difficulty Prediction

Category

Tweets related to Death of Sushant Singh Rajput

Category

English language tweets dataset for COVID-19

Category

COVID-19 tweets dataset for Bengali language

Category

COVID-19 tweets dataset for Spanish language

Category