Datasets
Standard Dataset
Shanghai Dialect and Mandarin
- Citation Author(s):
- Submitted by:
- Yida Bao
- Last updated:
- Tue, 04/22/2025 - 01:58
- DOI:
- 10.21227/ndgv-4655
- Data Format:
- License:
- Categories:
- Keywords:
Abstract
This dataset is designed for the classification of textual transcriptions of spoken conversations in Shanghai dialect and Mandarin Chinese. It consists of high-quality, manually transcribed texts from natural dialogues, annotated with corresponding language labels (Shanghai dialect: 1, Mandarin: 0). The dataset aims to facilitate research in text-based dialect classification, natural language processing (NLP), and linguistic variation analysis.
Each text sample is derived from real-world spoken conversations, ensuring authentic sentence structures, colloquial expressions, and dialectal differences between Shanghai dialect and Mandarin. The dataset includes metadata such as speaker demographics, dialogue context, and sentence length, allowing for a more comprehensive analysis of dialectal variations.
This dataset is particularly useful for:
- Dialect classification using machine learning and deep learning models
- Text-based language identification in NLP applications
- Linguistic analysis of Shanghai dialect vs. Mandarin Chinese
- Improving text preprocessing for speech-to-text models
By providing a well-annotated corpus of transcribed spoken text, this dataset enables researchers to train and evaluate classification models while advancing studies in Chinese dialect processing and computational linguistics.
This dataset is designed for binary classification of spoken conversations in Shanghai dialect (label: 1) and Mandarin (label: 0). It consists of high-quality audio recordings collected from natural conversations, annotated with corresponding language labels. The dataset supports research in dialect classification, speech recognition, and NLP-based language identification.
Documentation
Attachment | Size |
---|---|
596 bytes |