SLCeleb for Speaker Verification

Citation Author(s):: Dimuthu Anuraj (University of Jaffna)

Jarashanth Selvarajah (University of Jaffna)

Kanagasundaram Ahilan (University of Jaffna)

Ragupathyraj Valluvan (University of Jaffna)

Thiruvaran Tharmarajah (University of Jaffna)

Anantharajah Kaneswaran (University of Jaffna)
Submitted by:: S.P. Dimuthu Anuraj
Last updated:: Tue, 03/21/2023 - 11:10
DOI:: 10.21227/smmf-e298
Data Format:: *.wav

*.py *.zip
Research Article Link:: Evaluating Deep Neural Network-based Speaker Verification Systems on Sinhala an…

481 views

Categories:

Keywords:

Sinhala

Tamil

Speaker Verification

ACCESS DATASET CITE

Abstract

SLCeleb

Here we collected data through social media such as Youtube, because the best method to obtain data from a variety of wild and diverse acoustic environments is to use a freely available source. Otherwise, manually creating such volatility would take a long time. Even after that, we will not be able to share the data collected with other researchers.

We are proposing a hybrid (semi-automatic) pipeline for collecting the audio utterances in Sri Lankan local languages. It consists of four steps. The steps of the collection process are summarized as follows.

Step 1. Design a list of Sri Lankan popular persons. We will be manually selecting each 100 Tamil and Sinhala famous people as our target speakers.

Step 2. Pictures and videos download. Pictures and videos of the 200 famous people were downloaded from several media sources by searching for the names of the persons. Also, here we will collect videos mainly in YouTube and we are proposing to extract the frames of the video and extract the person images from those films. Where this will lead to creating of a large face database for the research community.

Step 3. Face detection and tracking. For each person, we first obtained the portrait of the person by detecting and clipping the face images from all pictures of that person. Then we will apply face detection and tracking methods to detect the in which time frame the person is appearing.

Step 4. Person-speaking verification by mouth-speech synchronization. We implement a mouth speech synchronization detection system to verify that the target person is speaking in a particular time frame, by testing if the mouth movement of the target person is synchronized with the speech signal.

Step 6. Manual validation. The POI segments produced with the above-automated pipeline were finally checked manually to ensure quality.

The SLCeleb speaker verification database is a collection of audio recordings from 280 Sinhala and Tamil celebrities.

Each speaker contain 100+ utterance and the database contain of four different genres utterances. Total 80 speakers in test list with Tamil and Sinhalese and 200 speakers of both languages for development.

Original data files are available in the following link. Sample data is uploaded here due to limitation of the bandwidth.

Folder structure,

SLCeleb Dataset

Sinhala
1. Test set
  1. <speakerid>
    1. <genre type>
      - <utterance>
    2. Development set
  2. Tamil
    1. Test set
    2. Development set

Full Database link. Click here

https://drive.google.com/drive/folders/1A_INdaAl-16mMscOpzO-Qcj37rOgfOKE?usp=share_link

Instructions:

SLCeleb Dataset

The SLCeleb dataset is a large-scale speaker dataset that contains over a thousand of utterances from 280 of Sinhala and Tamil celebrities from various fields such as politics, sports, and entertainment. The dataset is designed to enable research in speaker verification.

Dataset Overview

SLCeleb: Contains two languages (Sinhala & Tamil)
The dataset includes a wide range of speakers from different backgrounds, genders, ages, and accents. The audio recordings are in English and vary in length from a few seconds to several minutes.

Development set	Sinhala	Tamil
# of POIs	110	100
# of Videos	1025	1100
# of Utterance	12650	12100

Test set	Sinhala	Tamil
# of POIs	40	40
# of Videos	324	350
# of Utterance	4620	4730

File Format

The dataset is provided in a compressed form as a set of 7-zip archives. Each archive contains a set of subdirectories, where each subdirectory corresponds to a single speaker. Within each speaker directory, there are a set of audio files, named using the following convention:

<speaker_id>/<utterance_id>.<extension>

where <speaker_id> is a unique identifier for each speaker, <utterance_id> is a unique identifier for each utterance, and <extension> represents the file format (e.g., wav ).

Usage

Researchers can use the SLCeleb dataset to develop and evaluate various speaker recognition algorithms such as speaker verification. The dataset can be downloaded from the here.

Citation

If you use the SLCeleb dataset in your research, please cite the following paper:

S. P. D. Anuraj, S. T. Jarashanth, K. Ahilan, R. Valluvan, T. Thiruvaran and A. Kaneswaran, "Evaluating Deep Neural Network-based Speaker Verification Systems on Sinhala and Tamil Datasets," 2022 6th SLAAI International Conference on Artificial Intelligence (SLAAI-ICAI), Colombo, Sri Lanka, 2022, pp. 1-5, doi: 10.1109/SLAAI-ICAI56923.2022.10002663.

License

The SLCeleb dataset is released under the Creative Commons Attribution 4.0 International License. For more information, please see the LICENSE file in the dataset download.

Ownership of the database and all data contained therein shall be retained by University of Jaffna. However, University of Jaffna licenses the database under the terms of the Creative Commons 4.0 license. By accessing or using the database, you agree to be bound by the terms of the Creative Commons license. Any use of the database or its contents by any party other than University of Jaffna must comply with the terms of the Creative Commons license. University of Jaffna reserves the right to revoke the Creative Commons license at any time and for any reason.

Funding Agency

Accelerating Higher Education Expansion and Development (AHEAD) Operation and University of Jaffna

Grant Number

6026-LK/8743-LK