Natural Language Processing
To download this dataset without purchasing an IEEE Dataport subscription, please visit: https://zenodo.org/records/13896353
Please cite the following paper when using this dataset:
- Categories:
To download the dataset without purchasing an IEEE Dataport subscription, please visit: https://zenodo.org/records/13738598
Please cite the following paper when using this dataset:
- Categories:
This dataset contains audio recordings and transcriptions of toxic speech derived from Indonesian conversations during YouTube videos where scammers are confronted. The dataset captures two separate interactions that escalate into toxic exchanges. Each interaction has been verified by native Indonesian speakers and labeled into two classes: toxic and non-toxic. The dataset includes both the original and preprocessed versions of the speech and text data. The original speech files total 136MB, while the preprocessed speech files are 111,7MB.
- Categories:
Data were collected through the Twitter API, focusing on specific vocabulary related to wildfires, hashtags commonly used during the Tubbs Fire, and terms and hashtags related to mental health, well-being, and physical symptoms associated with smoke and wildfire exposure. We focused exclusively on the period from October 8 to October 31, aligning precisely with the duration of the Tubbs Fire. The final dataset available for analysis consists of 90,759 tweets.
- Categories:
The datasets are used to test an embedding imputation model. There are two different experiments: finance and mobile applications.
- Categories:
Within the Natural Language Processing (NLP) framework, Named Entity Recognition (NER) is regarded as the basis for extracting key information to understand texts in any language. As Bangla is a highly inflectional, morphologically rich, and resource-scarce language, building a balanced NER corpus with large and diverse entities is a demanding task. However, previously developed Bangla NER systems are limited to recognizing only three familiar entities: person, location, and organization.
- Categories:
This benchmark dataset accompanies an article paper titled ``Learning to Reuse Distractors to support Multiple Choice Question Generation in Education''. It contains a test of 298 educational questions covering multiple subjects & languages and a 77K multilingual pool of distractor vocabulary. The goal is for a given question to propose a list of relevant candidate distractors from the pool of distractors.
- Categories:
Chinese electric power audit text dataset
- Categories:
The dialogue corpus is described in the paper "Anticipating User Intentions in Customer Care Dialogue Systems" and contains a selection of human-chatbot Italian dialogues concerning customer-care requests.
In order to preserve the privacy and company data property, we removed the actual sentences and we present only the annotation described in the paper.
- Categories: