Datasets
Standard Dataset
Bilingual question pair
- Citation Author(s):
- Submitted by:
- Seema Rani
- Last updated:
- Fri, 11/18/2022 - 06:04
- DOI:
- 10.21227/s70t-0s06
- Research Article Link:
- License:
- Categories:
Abstract
Although asking and replying on social media platforms in mixed language is a very common phenomenon these days, there is lack of precise corpora to analyze such code mixed language. Datasets released by various CQA sites are monolingual i.e. only in English language. To perform our task, we needed annotated bilingual dataset which include Question pairs in mashed up language. In view of this scarcity we created a dataset by scraping pairs of questions from distinct social media networks, for-example Yahoo! Answers, Quora and TripAdvisor. This way, the collected dataset consists of questions from diverse fields like education, entertainment, health, philosophy, sports etc., in the pair we included one English question and the other one is from Hinglish language. This second question may or may not be equivalent to the first one. Also, a label “Is_Duplicate” is used to indicate whether two equations in any question pair are semantically duplicate of each other.
Although asking and replying on social media platforms in mixed language is a very common phenomenon these days, there is lack of precise corpora to analyze such code mixed language. Datasets released by various CQA sites are monolingual i.e. only in English language. To perform our task, we needed annotated bilingual dataset which include Question pairs in mashed up language. In view of this scarcity we created a dataset by scraping pairs of questions from distinct social media networks, for-example Yahoo! Answers, Quora and TripAdvisor. This way, the collected dataset consists of questions from diverse fields like education, entertainment, health, philosophy, sports etc., in the pair we included one English question and the other one is from Hinglish language. This second question may or may not be equivalent to the first one. Also, a label “Is_Duplicate” is used to indicate whether two equations in any question pair are semantically duplicate of each other.
Documentation
Attachment | Size |
---|---|
bilingual question pair.docx | 11.66 KB |