Shanghai Dialect and Mandarin

Citation Author(s):: Yida Bao (University of Wisconsin-Stout)
Submitted by:: Yida Bao
Last updated:: Tue, 04/22/2025 - 05:58
DOI:: 10.21227/ndgv-4655
Data Format:: *.csv

11 views

Categories:

Keywords:

NLP

artificial intelligence; deep learning

ACCESS DATASET CITE

Abstract

This dataset is designed for the classification of textual transcriptions of spoken conversations in Shanghai dialect and Mandarin Chinese. It consists of high-quality, manually transcribed texts from natural dialogues, annotated with corresponding language labels (Shanghai dialect: 1, Mandarin: 0). The dataset aims to facilitate research in text-based dialect classification, natural language processing (NLP), and linguistic variation analysis.

Each text sample is derived from real-world spoken conversations, ensuring authentic sentence structures, colloquial expressions, and dialectal differences between Shanghai dialect and Mandarin. The dataset includes metadata such as speaker demographics, dialogue context, and sentence length, allowing for a more comprehensive analysis of dialectal variations.

This dataset is particularly useful for:

Dialect classification using machine learning and deep learning models
Text-based language identification in NLP applications
Linguistic analysis of Shanghai dialect vs. Mandarin Chinese
Improving text preprocessing for speech-to-text models

By providing a well-annotated corpus of transcribed spoken text, this dataset enables researchers to train and evaluate classification models while advancing studies in Chinese dialect processing and computational linguistics.

Instructions:

This dataset is designed for binary classification of spoken conversations in Shanghai dialect (label: 1) and Mandarin (label: 0). It consists of high-quality audio recordings collected from natural conversations, annotated with corresponding language labels. The dataset supports research in dialect classification, speech recognition, and NLP-based language identification.