MTC-VC: A Multi-Task Contrastive Learning Method for Efficient and Controllable Voice Cloning

Citation Author(s):
Rui
Zhou
School of Design and Art, Shanghai Dianji University
Submitted by:
Rui Zhou
Last updated:
Mon, 04/07/2025 - 07:29
DOI:
10.21227/wpxz-3c67
License:
9 Views
Categories:
Keywords:
0
0 ratings - Please login to submit your rating.

Abstract 

The LibriSpeech corpus, a publicly available English speech dataset derived from audiobook recordings. The corpus contains approximately 1,000 hours of 16 kHz read speech from over 2,400 speakers, encompassing diverse speaking styles, rates, and regional accents. For the purpose of contrastive learning, a subset of 100 speakers was sampled, with 20 utterances per speaker ranging from 3 to 10 seconds. The dataset provides clean, labeled speech suitable for tasks involving speaker representation, acoustic modeling, and multi-style synthesis.

Instructions: 

The LibriSpeech corpus, a publicly available English speech dataset derived from audiobook recordings. The corpus contains approximately 1,000 hours of 16 kHz read speech from over 2,400 speakers, encompassing diverse speaking styles, rates, and regional accents. For the purpose of contrastive learning, a subset of 100 speakers was sampled, with 20 utterances per speaker ranging from 3 to 10 seconds. The dataset provides clean, labeled speech suitable for tasks involving speaker representation, acoustic modeling, and multi-style synthesis.