Shanghai Dialect and Mandarin

Citation Author(s):
Yida
Bao
University of Wisconsin-Stout
Submitted by:
Yida Bao
Last updated:
Tue, 04/22/2025 - 01:58
DOI:
10.21227/ndgv-4655
Data Format:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

This dataset is designed for the classification of textual transcriptions of spoken conversations in Shanghai dialect and Mandarin Chinese. It consists of high-quality, manually transcribed texts from natural dialogues, annotated with corresponding language labels (Shanghai dialect: 1, Mandarin: 0). The dataset aims to facilitate research in text-based dialect classification, natural language processing (NLP), and linguistic variation analysis.

Each text sample is derived from real-world spoken conversations, ensuring authentic sentence structures, colloquial expressions, and dialectal differences between Shanghai dialect and Mandarin. The dataset includes metadata such as speaker demographics, dialogue context, and sentence length, allowing for a more comprehensive analysis of dialectal variations.

This dataset is particularly useful for:

  • Dialect classification using machine learning and deep learning models
  • Text-based language identification in NLP applications
  • Linguistic analysis of Shanghai dialect vs. Mandarin Chinese
  • Improving text preprocessing for speech-to-text models

 

By providing a well-annotated corpus of transcribed spoken text, this dataset enables researchers to train and evaluate classification models while advancing studies in Chinese dialect processing and computational linguistics.

Instructions: 

 

This dataset is designed for binary classification of spoken conversations in Shanghai dialect (label: 1) and Mandarin (label: 0). It consists of high-quality audio recordings collected from natural conversations, annotated with corresponding language labels. The dataset supports research in dialect classification, speech recognition, and NLP-based language identification.

Documentation

AttachmentSize
File readme.txt596 bytes