Bilingual question pair

Citation Author(s):
Seema
Rani
Submitted by:
Seema Rani
Last updated:
Fri, 11/18/2022 - 06:04
DOI:
10.21227/s70t-0s06
Link to Paper:
License:
43 Views
Categories:
0
0 ratings - Please login to submit your rating.

Abstract 

Although asking and replying on social media platforms in mixed language is a very common phenomenon these days, there is lack of precise corpora to analyze such code mixed language. Datasets released by various CQA sites are monolingual i.e. only in English language. To perform our task, we needed annotated bilingual dataset which include Question pairs in mashed up language. In view of this scarcity we created a dataset by scraping pairs of questions from distinct social media networks, for-example Yahoo! Answers, Quora and TripAdvisor.  This way, the collected dataset consists of questions from diverse fields like education, entertainment, health, philosophy, sports etc., in the pair we included one English question and the other one is from Hinglish language. This second question may or may not be equivalent to the first one. Also, a label “Is_Duplicate” is used to indicate whether two equations in any question pair are semantically duplicate of each other. 

Instructions: 

Although asking and replying on social media platforms in mixed language is a very common phenomenon these days, there is lack of precise corpora to analyze such code mixed language. Datasets released by various CQA sites are monolingual i.e. only in English language. To perform our task, we needed annotated bilingual dataset which include Question pairs in mashed up language. In view of this scarcity we created a dataset by scraping pairs of questions from distinct social media networks, for-example Yahoo! Answers, Quora and TripAdvisor.  This way, the collected dataset consists of questions from diverse fields like education, entertainment, health, philosophy, sports etc., in the pair we included one English question and the other one is from Hinglish language. This second question may or may not be equivalent to the first one. Also, a label “Is_Duplicate” is used to indicate whether two equations in any question pair are semantically duplicate of each other.