LATIC: A Non-native Pre-labelled Mandarin Chinese Validation Corpus for Automatic Speech Scoring and Evaluation Task

Citation Author(s):
XIAO
ZHANG
School of Foreign Languages, Hunan University of Technology
Submitted by:
Xiao Zhang
Last updated:
Tue, 05/25/2021 - 07:58
DOI:
10.21227/mqtj-qh10
Links:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

LATIC is focusing on non-native Mandarin Chinese learners. It is an annotated non-native speech database for Chinese, which is fully open-source can get online for any purpose use. The related using area can be automatic speech scoring, evaluation, derivation—L2 teaching, Education of Chinese as Foreign Language, etc. We are aiming to provide a relatively small-scale and highly efficient training deviation dataset. For this target, four chosen non-native Chinese speaker participated in this project, and their mother tongue (L1s) varies from Russian, Korean, French and Arabic. It outputs a 1-hour testing audio file (valid recording) for each tester, which has 4 hours of materials. We intend to expand the scale of our current database continuously in the future as well. 

Instructions: 

 

Motivation

As we know, since 1997, although the non-native speaker corpus has made a promising breakthrough from the durations' length and the diversities of the non-native speakers for the Mandarin speech scoring task. But the primary target language still focused more on English. After Chen et al. (2015) released the iCall corpus, the non-native Chinese speech corpus's shortage got relieved, but the open source dataset still has not appeared. Still, there is no total opensource (as far as the best we know). And this will be a big burden for new-learners or researchers to do the further research. In the future, we have a strong belief that the open resource datasets target for non-native Chinese speakers will grow significantly after LATIC released.

 

 Description of LATIC

Here is the description of our validation corpus; we will introduce it from these two parts: speaker's statistics and annotation.

Speaker's statistics: first, to initialize, the default output pinyin representation size is 1424, that is, 1423 pinyin + 1 blank block. We have defined a total of 1423 pinyin in dict.txt.

Currently, there are 4 participants in our dataset, which includes two males and two females, and their age varies from 19 to 30. The average age is about 24. Our dataset contains 4 hours of speech files, 2,579 audio samples, and the average length is about 9-10secs. 

Annotation: we set three script notations from three levels add to each waveform file. Two Chinese major students play the role as protocols, who are very proficient in the Chinese L2 language, and Mandarin Chinese is their native tongue. After listening to the recording for each file, they recorded the "closest" transcript followed by the modern Mandarin annotations.