Natural Language Processing

This dataset contains audio recordings and transcriptions of toxic speech derived from Indonesian conversations during YouTube videos where scammers are confronted. The dataset captures two separate interactions that escalate into toxic exchanges. Each interaction has been verified by native Indonesian speakers and labeled into two classes: toxic and non-toxic. The dataset includes both the original and preprocessed versions of the speech and text data. The original speech files total 136MB, while the preprocessed speech files are 111,7MB.

Categories:
208 Views

Data were collected through the Twitter API, focusing on specific vocabulary related to wildfires, hashtags commonly used during the Tubbs Fire, and terms and hashtags related to mental health, well-being, and physical symptoms associated with smoke and wildfire exposure. We focused exclusively on the period from October 8 to October 31, aligning precisely with the duration of the Tubbs Fire. The final dataset available for analysis consists of 90,759 tweets.

Categories:
293 Views

The datasets are used to test an embedding imputation model. There are two different experiments: finance and mobile applications.

Categories:
6 Views

Within the Natural Language Processing (NLP) framework, Named Entity Recognition (NER) is regarded as the basis for extracting key information to understand texts in any language. As Bangla is a highly inflectional, morphologically rich, and resource-scarce language, building a balanced NER corpus with large and diverse entities is a demanding task. However, previously developed Bangla NER systems are limited to recognizing only three familiar entities: person, location, and organization.

Categories:
474 Views

Food computing is currently a fast-growing field of research. Web mining and content analysis are also increasingly essential in this field, especially for recognising food entities.

Categories:
970 Views

This benchmark dataset accompanies an article paper titled ``Learning to Reuse Distractors to support Multiple Choice Question Generation in Education''. It contains a test of 298 educational questions covering multiple subjects & languages and a 77K multilingual pool of distractor vocabulary. The goal is for a given question to propose a list of relevant candidate distractors from the pool of distractors. 

Categories:
358 Views

Chinese electric power audit text dataset 

Categories:
26 Views

The dialogue corpus is  described in the paper "Anticipating User Intentions in Customer Care Dialogue Systems" and contains a selection of human-chatbot Italian dialogues concerning customer-care requests.

In order to preserve the privacy and company data property, we removed the actual sentences and we present only the annotation described in the paper.

Categories:
436 Views

Pages