Audio

MTC-VC: A Multi-Task Contrastive Learning Method for Efficient and Controllable Voice Cloning

The LibriSpeech corpus, a publicly available English speech dataset derived from audiobook recordings. The corpus contains approximately 1,000 hours of 16 kHz read speech from over 2,400 speakers, encompassing diverse speaking styles, rates, and regional accents. For the purpose of contrastive learning, a subset of 100 speakers was sampled, with 20 utterances per speaker ranging from 3 to 10 seconds. The dataset provides clean, labeled speech suitable for tasks involving speaker representation, acoustic modeling, and multi-style synthesis.

Categories:: Signal Processing

11 Views

Various types of datasets

This dataset contains a diverse range of file types, including text, images, and audio, designed for multi-modal analysis and research. It includes text files (txt) with both structured and unstructured data, suitable for natural language processing tasks such as sentiment analysis and text classification. The image files cover various subjects and are intended for computer vision tasks like object detection and classification. Additionally, the dataset includes audio files in formats like MP3 and WAV, supporting speech recognition and sound analysis.

Categories:: Communications

44 Views

Audio of the Multiple People Walking

This dataset consists of carefully curated audio recordings that capture the distinct sounds produced by multiple individuals walking in various environments. Designed to support research in sound recognition, activity analysis, and the study of human behaviour, it provides a rich resource for understanding how group dynamics influence acoustic patterns. Each recording is accompanied by detailed metadata, including the number of participants, environmental context, and surface characteristics.

Categories:: Biomedical and Health Sciences

271 Views

Dental Digital Scribe Dataset

The large and diverse access to data sources in healthcare has boosted the application of novel computer techniques that can extract meaningful information to improve patients' prognoses and other important medical uses. However, current systems require the professional to manually type the information, which increases the risk of transcription errors and cross-contamination. We propose an automated system that allows healthcare professionals to dictate clinical information that can be transcribed and analyzed.

Categories:: Artificial Intelligence

128 Views

Neural Audio Fingerprint Dataset

Most of existing audio fingerprinting systems have limitations to be used for high-specific audio retrieval at scale. In this work, we generate a low-dimensional representation from a short unit segment of audio, and couple this fingerprint with a fast maximum inner-product search. To this end, we present a contrastive learning framework that derives from the segment-level search objective. Each update in training uses a batch consisting of a set of pseudo labels, randomly selected original samples, and their augmented replicas.

Categories:: Artificial Intelligence
Machine Learning
Digital signal processing
Other

2166 Views

High-quality amplification technology package for silicon transistor amplifier

The sound part is built into many products.

It is used not only in audio systems, but also in a wide range of industries such as home theaters, broadcast amplifier systems, TVs, computers, AI speakers, and game consoles.

Even now, many companies are making efforts to improve the sound quality of the acoustic part.

In the future, high sound quality will be required in many industrial fields.

A wide range of industries will require high-quality technology.

Categories:: Analog signal processing
Other

120 Views

HUMAN4D: A Human-Centric Multimodal Dataset for Motions & Immersive Media

We introduce HUMAN4D, a large and multimodal 4D dataset that contains a variety of human activities simultaneously captured by a professional marker-based MoCap, a volumetric capture and an audio recording system. By capturing 2 female and 2 male professional actors performing various full-body movements and expressions, HUMAN4D provides a diverse set of motions and poses encountered as part of single- and multi-person daily, physical and social activities (jumping, dancing, etc.), along with multi-RGBD (mRGBD), volumetric and audio data. Despite the existence of multi-view color datasets c

Categories:: Artificial Intelligence
Computer Vision
Image Processing
Machine Learning

1620 Views

A Time-Scale Modification Dataset with Subjective Quality Labels

Time Scale Modification (TSM) is a well-researched field; however, no effective objective measure of quality exists. This paper details the creation, subjective evaluation, and analysis of a dataset for use in the development of an objective measure of quality for TSM. Comprised of two parts, the training component contains 88 source files processed using six TSM methods at 10 time scales, while the testing component contains 20 source files processed using three additional methods at four time scales.

Categories:: Machine Learning
Signal Processing
Discrete-time signal processing
Digital signal processing

683 Views

Heidelberg Spiking Datasets

The Heidelberg Spiking Datasets comprise two spike-based classification datasets: The Spiking Heidelberg Digits (SHD) dataset and the Spiking Speech Command (SSC) dataset. The latter is derived from Pete Warden's Speech Commands dataset (https://arxiv.org/abs/1804.03209), whereas the former is based on a spoken digit dataset recorded in-house and included in this repository. Both datasets were generated by applying a detailed inner ear model to audio recordings. We distribute the input spikes and target labels in HDF5 format.

Categories:: Machine Learning

2807 Views

Electroencephalogram (EEG) recordings obtained when simultaneously presenting audio stimulations

The dataset consists of EEG recordings obtained when subjects are listening to different utterances : a, i, u, bed, please, sad. A limited number of EEG recordings where also obtained when the three vowels were corrupted by white and babble noise at an SNR of 0dB. Recordings were performed on 8 healthy subjects.

Categories:: Brain

1503 Views

Audio

Audio

Pages