Most of existing audio fingerprinting systems have limitations to be used for high-specific audio retrieval at scale. In this work, we generate a low-dimensional representation from a short unit segment of audio, and couple this fingerprint with a fast maximum inner-product search. To this end, we present a contrastive learning framework that derives from the segment-level search objective. Each update in training uses a batch consisting of a set of pseudo labels, randomly selected original samples, and their augmented replicas.


Neural Audio Fingerprint Dataset

(c) 2021 by Sungkyun Chang


This dataset includes all music sources, background noises and impulse-reponses (IR) samples that have been used in the work "Neural Audio Fingerprint for High-specific Audio Retrieval based on Contrastive Learning" ( 

This data set was generated by processing several external data sets, such as the Free Music Archive (FMA), Audioset, Common voice, Aachen IR, OpenAIR, Vintage MIC and the internal data set from See for details.

Dataset-mini vs. Dataset-full: the only difference between these two datasets is the size of 'test-dummy-db'.  So you can first train and test with `Dataset-mini`. `Dataset-full` is for  testing in 100x larger scale.



The sound part is built into many products.

It is used not only in audio systems, but also in a wide range of industries such as home theaters, broadcast amplifier systems, TVs, computers, AI speakers, and game consoles.

Even now, many companies are making efforts to improve the sound quality of the acoustic part.

In the future, high sound quality will be required in many industrial fields.

A wide range of industries will require high-quality technology.



This material contains registered patented technology.


We introduce HUMAN4D, a large and multimodal 4D dataset that contains a variety of human activities simultaneously captured by a professional marker-based MoCap, a volumetric capture and an audio recording system. By capturing 2 female and 2 male professional actors performing various full-body movements and expressions, HUMAN4D provides a diverse set of motions and poses encountered as part of single- and multi-person daily, physical and social activities (jumping, dancing, etc.), along with multi-RGBD (mRGBD), volumetric and audio data. Despite the existence of multi-view color datasets c


* At this moment, the paper of this dataset is under review. The dataset is going to be fully published along with the publication of the paper, while in the meanwhile, more parts of the dataset will be uploaded.

The dataset includes multi-view RGBD, 3D/2D pose, volumetric (mesh/point-cloud/3D character) and audio data along with metadata for spatiotemporal alignment.

The full dataset is splitted per subject and per activity per modality.

There are also two benchmarking subsets, H4D1 for single-person and H4D2 for two-person sequences, respectively.

The fornats are:

  • mRGBD: *.png
  • 3D/2D poses: *.npy
  • volumetric (mesh/point-cloud/): *.ply
  • 3D character: *.fbx
  • metadata: *.txt, *.json



Time Scale Modification (TSM) is a well-researched field; however, no effective objective measure of quality exists.  This paper details the creation, subjective evaluation, and analysis of a dataset for use in the development of an objective measure of quality for TSM. Comprised of two parts, the training component contains 88 source files processed using six TSM methods at 10 time scales, while the testing component contains 20 source files processed using three additional methods at four time scales.


When using this dataset, please use the following citation:

author = {Roberts,Timothy and Paliwal,Kuldip K. },
title = {A time-scale modification dataset with subjective quality labels},
journal = {The Journal of the Acoustical Society of America},
volume = {148},
number = {1},
pages = {201-210},
year = {2020},
doi = {10.1121/10.0001567},
URL = {},
eprint = {}


Audio files are named using the following structure: SourceName_TSMmethod_TSMratio_per.wav and split into multiple zip files.For 'TSMmethod', PV is the Phase Vocoder algorithm, PV_IPL is the Identity Phase Locking Phase Vocoder algorithm, WSOLA is the Waveform Similarity Overlap-Add algorithm, FESOLA is the Fuzzy Epoch Synchronous Overlap-Add algorithm, HPTSM is the Harmonic-Percussive Separation Time-Scale Modification algorithm and uTVS is the Mel-Scale Sub-Band Modelling Filterbank algorithm. Elastique is the z-Plane Elastique algorithm, NMF is the Non-Negative Matrix Factorization algorithm and FuzzyPV is the Phase Vocoder algorithm using Fuzzy Classification of Spectral Bins.TSM ratios range from 33% to 192% for training files, 20% to 200% for testing files and 22% to 220% for evaluation files.

  • Train: Contains 5280 processed files for training neural networks
  • Test: Contains 240 processed files for testing neural networks
  • Ref_Train: Contains the 88 reference files for the processed training files
  • Ref_Test: Contains the 20 reference files for the processed testing files
  • Eval: Contains 6000 processed files for evaluating TSM methods.  The 20 reference test files were processed at 20 time-scales using the following methods:
    • Phase Vocoder (PV)
    • Identity Phase-Locking Phase Vocoder (IPL)
    • Scaled Phase-Locking Phase Vocoder (SPL)
    • Phavorit IPL and SPL
    • Phase Vocoder with Fuzzy Classification of Spectral Bins (FuzzyPV)
    • Waveform Similarity Overlap-Add (WSOLA)
    • Epoch Synchronous Overlap-Add (ESOLA)
    • Fuzzy Epoch Synchronous Overlap-Add (FESOLA)
    • Driedger's Identity Phase-Locking Phase Vocoder (DrIPL)
    • Harmonic Percussive Separation Time-Scale Modification (HPTSM)
    • uTVS used in Subjective testing (uTVS_Subj)
    • updated uTVS (uTVS)
    • Non-Negative Matrix Factorization Time-Scale Modification (NMFTSM)
    • Elastique.


TSM_MOS_Scores.mat is a version 7 MATLAB save file and contains a struct called data that has the following fields:

  • test_loc: Legacy folder location of the test file.
  • test_name: Name of the test file.
  • ref_loc: Legacy folder location of reference file.
  • ref_name: Name of the reference file.
  • method: The method used for processing the file.
  • TSM: The time-scale ratio (in percent) used for processing the file. 100(%) is unity processing. 50(%) is half speed, 200(%) is double speed.
  • MeanOS: Normalized Mean Opinion Score.
  • MedianOS: Normalized Median Opinion Score.
  • std: Standard Deviation of MeanOS.
  • MeanOS_RAW: Mean Opinion Score before normalization.
  • MedianOS_RAW: Median Opinion Scores before normalization.
  • std_RAW: Standard Deviation of MeanOS before normalization.


TSM_MOS_Scores.csv is a csv containing the same fields as columns.

Source Code and method implementations are available at

Please Note: Labels for the files will be uploaded after paper publication.


The Heidelberg Spiking Datasets comprise two spike-based classification datasets: The Spiking Heidelberg Digits (SHD) dataset and the Spiking Speech Command (SSC) dataset. The latter is derived from Pete Warden's Speech Commands dataset (, whereas the former is based on a spoken digit dataset recorded in-house and included in this repository. Both datasets were generated by applying a detailed inner ear model to audio recordings. We distribute the input spikes and target labels in HDF5 format.


We provide two distinct classification datasets for spiking neural networks. | Name | Classes | Samples (train/valid/test) | Parent dataset | URL | | ---- | ------- | ------ | ------------------------- | --- | | SHD | 20 | 8332/-/2088 | Heidelberg Digits (HD) | | | SSC | 35 | 75466/9981/20382 | Speech Commands v0.2 | | Both datasets are based on respective audio datasets. Spikes in 700 input channels were generated using an artificial cochlea model. The SHD consists of approximately 10000 high-quality aligned studio recordings of spoken digits from 0 to 9 in both German and English language. Recordings exist of 12 distinct speakers two of which are only present in the test set. The SSC is based on the Speech Commands release by Google which consists of utterances recorded from a larger number of speakers under less controlled conditions. It contains 35 word categories from a larger number of speakers.


The dataset consists of EEG recordings obtained when subjects are listening to different utterances : a, i, u, bed, please, sad. A limited number of EEG recordings where also obtained when the three vowels were corrupted by white and babble noise at an SNR of 0dB. Recordings were performed on 8 healthy subjects.


Recordings were performed at the Centre de recherche du Centre hospitalier universitaire de Sherbrooke (CRCHUS), Sherbrooke (Quebec), Canada. The EEG recordings were performed using an actiCAP active electrode system Version I and II (Brain Products GmbH, Germany) that includes 64 Ag/AgCl electrodes. The signal was amplified with BrainAmp MR amplifiers and recorded using the Vision Recorder software. The electrodes were positioned using a standard 10-20 layout. Experiments were performed on 8 healthy subjects without any declared hearing impairment. Each session lasted approximately 90 minutes and was separated in 2 parts. The first part, lasting 30 minutes, consisted in installing the cap on the subject where an electroconductive gel was placed under each electrode to ensure a proper contact between the electrode and the scalp. The second part, which was the listening and EEG acquisition, lasted approximately 60 minutes. The subjects then had to stay still with eyes closed while avoiding any facial movement or swallowing. They had to remain concentrated on the audio signals during the full length of the experiment. Audio signals were presented to the subjects through earphones while EEGs were recorded. During the experiment, each trial was repeated randomly at least 80 times. A stimulus was presented randomly within each trial which lasted approximately 9 seconds. A 2-minute pause was given after 5 minutes of trials where the subjects could relax and stretch. Once the EEG signals were acquired, they were resampled at 500 Hz and band-pass filtered between 0.1 Hz and 45 Hz in order to extract the frequency bands of interest for this study. EEG signals were then separated into 2-second intervals where the stimulus was presented at 0.5 second within each interval. If the signal amplitude exceeded a pre-defined 75 V limit, the trial was marked for rejection. A sample code is provided to read the dataset and generate ERPs. One needs first to run the epoch_data.m for the specific subject and then run the mean_data.m file in the ERP folder. EEGLab for Matlab is required.


이 그룹은 음향 장비 분야의 음향 증폭기 회로에 실리콘 트랜지스터를 사용할 때 오디오 관점에서 출력 신호의 음향 특성을 향상시키는 것을 목표로합니다. 음향 증폭기 회로에서 실리콘 트랜지스터를 사용할 때 사운드 출력은 차갑고 거칠고 선명하며 풍부하지 않은 것으로 알려져 있습니다. 따라서 오디오 애호가들은 여전히 ​​진공관을 사용하는 증폭기를 좋아합니다. 오디오 애호가는 소리를들을 수있는 능력 때문에 음질을 판단하는 뛰어난 능력을 가지고 있습니다. 그래서이 그룹은 실리콘 트랜지스터의 불충분 한 사운드 특성을 제거하고 튜브의 사운드로 개선하는 방법을 보여줍니다. 이 개선 방법을 적용하여이 부분을 만드는 과정을 논의하고 마지막으로 실제 부분으로 만들 것입니다.

한국어로 쓰여진 원본이 포함되어 있습니다.

여러 파일 중 가장 큰 개정 번호 만 사용하십시오.  나머지는 비교를위한 데이터입니다. 


We propose a new concept audio system, It is an audio system with slots for inserting function units in one main body. It is a group for producing the first product for standardization. (Network audio player, DDC, DAC, PHONO equalizer, PRE Amplifier, POWER Amplifier, POWER SUPPLY, etc.)    The internal main board has slots for inserting the unit, and the corresponding unit can be installed and replaced with another compatible unit.  Function units are made in card format and can be upgraded or replaced with other branded products in the future.


The steganography and steganalysis of audio, especially compressed audio, have drawn increasing attention in recent years, and various algorithms are proposed. However, there is no standard public dataset for us to verify the efficiency of each proposed algorithm. Therefore, to promote the study field, we construct a dataset including 33038 stereo WAV audio clips with a sampling rate of 44.1 kHz and duration of 10s. And, all audio files are from the Internet through data crawling, which is for a better simulation of a real detection environment.