MTC-VC: A Multi-Task Contrastive Learning Method for Efficient and Controllable Voice Cloning

Citation Author(s):: Rui Zhou (School of Design and Art, Shanghai Dianji University)
Submitted by:: Rui Zhou
Last updated:: Mon, 04/07/2025 - 11:29
DOI:: 10.21227/wpxz-3c67

11 views

Categories:

Signal Processing

Keywords:

Audio

ACCESS DATASET CITE

Abstract

The LibriSpeech corpus, a publicly available English speech dataset derived from audiobook recordings. The corpus contains approximately 1,000 hours of 16 kHz read speech from over 2,400 speakers, encompassing diverse speaking styles, rates, and regional accents. For the purpose of contrastive learning, a subset of 100 speakers was sampled, with 20 utterances per speaker ranging from 3 to 10 seconds. The dataset provides clean, labeled speech suitable for tasks involving speaker representation, acoustic modeling, and multi-style synthesis.