The following dataset consists of utterances, recorded using 24 volunteers raised in the Province of Manitoba, Canada. To provide a repeatable set of test words that would cover all of the phonemes, the Edinburg Machine Readable Phonetic Alphabet (MRPA) [KiGr08], consisting of 44 words is used. Each recording consists of one word uttered by the volunteer and recorded in one continuous session.


The "Thaat and Raga Forest (TRF) Dataset" represents a significant advancement in computational musicology, focusing specifically on Indian Classical Music (ICM). While Western music has seen substantial attention in this field, ICM remains relatively underexplored. This manuscript presents the utilization of Deep Learning models to analyze ICM, with a primary focus on identifying Thaats and Ragas within musical compositions. Thaats and Ragas identification holds pivotal importance for various applications, including sentiment-based recommendation systems and music categorization.


AIR-RS-DB: A dataset for classifying Spontaneous and Read Speech


A set of 1028 audio files generated from 7 mp3 files downloaded from All India Radio. https://newsonair.gov.in/ and converted into wav  and then speaker diarized is  using https://huggingface.co/pyannote/speaker-diarization (pyannote/speaker-diarization@2022072,model) and derive 1028 audio files.


The dataset consists of three parts, the first part consists of single notes and playing technique samples, and the second includes the triple viewed video, steoro-microphone recordings and 4 track optical vibration recordings in raw file for famous Chinese Folk music ‘Jasmine Flower’ and the first section of ‘Ambush from ten sides’. The third part concerns about the source separated tracks from optical recordings and expressive annotation files are included in the annotation files.


Dataset asscociated with a paper in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems

"Talk the talk and walk the walk: Dialogue-driven navigation in unknown indoor environments"

If you use this code or data, please cite the above paper.



Most of existing audio fingerprinting systems have limitations to be used for high-specific audio retrieval at scale. In this work, we generate a low-dimensional representation from a short unit segment of audio, and couple this fingerprint with a fast maximum inner-product search. To this end, we present a contrastive learning framework that derives from the segment-level search objective. Each update in training uses a batch consisting of a set of pseudo labels, randomly selected original samples, and their augmented replicas.


This dataset is generated by GNU Radio.


The steganography and steganalysis of audio, especially compressed audio, have drawn increasing attention in recent years, and various algorithms are proposed. However, there is no standard public dataset for us to verify the efficiency of each proposed algorithm. Therefore, to promote the study field, we construct a dataset including 33038 stereo WAV audio clips with a sampling rate of 44.1 kHz and duration of 10s. And, all audio files are from the Internet through data crawling, which is for a better simulation of a real detection environment.


This task evaluates performance of the sound event detection systems in multisource conditions similar to our everyday life, where the sound sources are rarely heard in isolation. Contrary to task 2, there is no control over the number of overlapping sound events at each time, not in the training nor in the testing audio data.

Last Updated On: 
Tue, 01/10/2017 - 15:56
Citation Author(s): 
Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen

Several established parameters and metrics have been used to characterize the acoustics of a room. The most important are the Direct-To-Reverberant Ratio (DRR), the Reverberation Time (T60) and the reflection coefficient. The acoustic characteristics of a room based on such parameters can be used to predict the quality and intelligibility of speech signals in that room.