MarathiSarc

Citation Author(s):
Pravin
Patil
Kavayitri Bahinabai Chaudhari North Maharashtra University Jalgaon
Satish
Kolhe
Kavayitri Bahinabai Chaudhari North Maharashtra University Jalgaon
Submitted by:
Pravin Patil
Last updated:
Mon, 12/16/2024 - 01:15
DOI:
10.21227/1d55-2f63
License:
0
0 ratings - Please login to submit your rating.

Abstract 

Sarcasm detection involves predicting whether a given text is sarcastic, a challenging task in sentiment analysis. While significant research has been conducted for languages like English, Czech, and Italian, limited work exists for Indian languages such as Hindi, Tamil, and Bengali. Marathi, being the third most spoken language in India, has seen little progress in sarcasm detection, mainly due to the lack of suitable datasets. To address this gap, we introduce MarathiSarc, a labeled dataset of Marathi tweets specifically designed for sarcasm detection, aimed at advancing research in this underexplored area of Natural Language Processing.

Instructions: 

Considering the limitation of Twitter API, we preferred to use the Twint library of twitter for collecting the tweets. Using this, we were able to collect 2361 tweets in Marathi language. In the first stage, using the hashtag based supervision technique we collected Marathi tweets containing hashtags such as #sarcasm, #sarcastic, #sarcasmic #irony, #ironic etc. The time period of the corpus is from December 2011 to September 2023.  We have manually labelled the entire dataset into three classes as follows:

• Tweets that contained the hashtags such as #sarcasm, #sarcastic, #sarcasmic, #irony, #ironic, #व्यंग  and found to be actually sarcastic are labelled as sarcastic. (1)

• Tweets that contained the hashtags such as #sarcasm, #sarcastic, #sarcasmic, #व्यंग,#irony, #ironic but are found to be actually non sarcastic are labelled as non-sarcastic. ( -1 )

•  Tweets which can be possibly sarcastic depending on the conversational history and the context are marked as possibly sarcastic. (2)