Datasets
Standard Dataset
SLCeleb for Speaker Verification
- Citation Author(s):
- Submitted by:
- S.P. Dimuthu Anuraj
- Last updated:
- Tue, 03/21/2023 - 07:10
- DOI:
- 10.21227/smmf-e298
- Data Format:
- Research Article Link:
- License:
- Categories:
- Keywords:
Abstract
SLCeleb
Here we collected data through social media such as Youtube, because the best method to obtain data from a variety of wild and diverse acoustic environments is to use a freely available source. Otherwise, manually creating such volatility would take a long time. Even after that, we will not be able to share the data collected with other researchers.
We are proposing a hybrid (semi-automatic) pipeline for collecting the audio utterances in Sri Lankan local languages. It consists of four steps. The steps of the collection process are summarized as follows.
Step 1. Design a list of Sri Lankan popular persons. We will be manually selecting each 100 Tamil and Sinhala famous people as our target speakers.
Step 2. Pictures and videos download. Pictures and videos of the 200 famous people were downloaded from several media sources by searching for the names of the persons. Also, here we will collect videos mainly in YouTube and we are proposing to extract the frames of the video and extract the person images from those films. Where this will lead to creating of a large face database for the research community.
Step 3. Face detection and tracking. For each person, we first obtained the portrait of the person by detecting and clipping the face images from all pictures of that person. Then we will apply face detection and tracking methods to detect the in which time frame the person is appearing.
Step 4. Person-speaking verification by mouth-speech synchronization. We implement a mouth speech synchronization detection system to verify that the target person is speaking in a particular time frame, by testing if the mouth movement of the target person is synchronized with the speech signal.
Step 6. Manual validation. The POI segments produced with the above-automated pipeline were finally checked manually to ensure quality.
The SLCeleb speaker verification database is a collection of audio recordings from 280 Sinhala and Tamil celebrities.
Each speaker contain 100+ utterance and the database contain of four different genres utterances. Total 80 speakers in test list with Tamil and Sinhalese and 200 speakers of both languages for development.
Original data files are available in the following link. Sample data is uploaded here due to limitation of the bandwidth.
Folder structure,
SLCeleb Dataset
- Sinhala
- Test set
- <speakerid>
- <genre type>
- <utterance>
- Development set
- <genre type>
- Tamil
- Test set
- Development set
- <speakerid>
- Test set
Full Database link. Click here
https://drive.google.com/drive/folders/1A_INdaAl-16mMscOpzO-Qcj37rOgfOKE...
Dataset Files
- test_list_sinhala.txt (2.84 MB)
- test_list_tamil.txt (2.56 MB)
- Common_files and Scripts.zip (290.15 kB)