Skip to main content

Datasets

Standard Dataset

Bilingual question pair

Citation Author(s):
Seema Rani
Submitted by:
Seema Rani
Last updated:
DOI:
10.21227/s70t-0s06
Research Article Link:
128 views
Categories:
No Ratings Yet

Abstract

Although asking and replying on social media platforms in mixed language is a very common phenomenon these days, there is lack of precise corpora to analyze such code mixed language. Datasets released by various CQA sites are monolingual i.e. only in English language. To perform our task, we needed annotated bilingual dataset which include Question pairs in mashed up language. In view of this scarcity we created a dataset by scraping pairs of questions from distinct social media networks, for-example Yahoo! Answers, Quora and TripAdvisor.  This way, the collected dataset consists of questions from diverse fields like education, entertainment, health, philosophy, sports etc., in the pair we included one English question and the other one is from Hinglish language. This second question may or may not be equivalent to the first one. Also, a label “Is_Duplicate” is used to indicate whether two equations in any question pair are semantically duplicate of each other. 

Instructions:

Although asking and replying on social media platforms in mixed language is a very common phenomenon these days, there is lack of precise corpora to analyze such code mixed language. Datasets released by various CQA sites are monolingual i.e. only in English language. To perform our task, we needed annotated bilingual dataset which include Question pairs in mashed up language. In view of this scarcity we created a dataset by scraping pairs of questions from distinct social media networks, for-example Yahoo! Answers, Quora and TripAdvisor.  This way, the collected dataset consists of questions from diverse fields like education, entertainment, health, philosophy, sports etc., in the pair we included one English question and the other one is from Hinglish language. This second question may or may not be equivalent to the first one. Also, a label “Is_Duplicate” is used to indicate whether two equations in any question pair are semantically duplicate of each other.