A PMUT-Based Bone Conduction Microphone System for Enhancing Speech Recognition Accuracy

- Citation Author(s):
-
Chongbin Liu (Wuhan University)
- Submitted by:
- Chongbin Liu
- Last updated:
- DOI:
- 10.21227/p5vj-qq91
- Data Format:
- Categories:
- Keywords:
Abstract
Speech recognition in noisy environments has long posed a challenge. Typically used air conduction microphone (ACM) is susceptible to environmental noise.
In this work, a customized bone conduction microphone (BCM) system based on piezoelectric micromachined ultrasonic transducer is developed to capture the real-time bone conduction (BC) speech, while a commercial ACM is integrated for simultaneous capture of air conduction (AC) speech. The system enables a simpler and more robust BC speech capture. The captured BC speech achieves a signal-to-noise amplitude ratio over 5 times greater than AC speech in a 68 dB noise environment. Instead of using only AC speech, both BC and AC speech are input into a speech enhancement module. The noise-insensitive BC speech serves as a speech reference to adapt the SE backbone of AC speech. The two types of speech are fused, and noise suppression is applied to generate enhanced speech.
Compared with the original noisy speech, the enhanced speech achieves a character error rate reduction of over 20%, approaching the speech recognition accuracy of clean speech. The results indicate that the speech enhancement method based on the fusion of BC and AC speech efficiently integrates the features of both types of speech, thereby improving speech recognition accuracy in noisy environments.
This work presents an innovative system designed to efficiently capture BC speech and enhance speech recognition in noisy environments.
Instructions:
Each folder contains the speech used in the chapters indicated by their names.
The sampling frequency for the BC and AC speech in Section 3 is 4 kHz.
In Section 4, the BC and AC speech has a sampling frequency of 16 kHz, which is consistent with the AISHELL-1 dataset used for acoustic model training.
The distinction between subfolders 1 and 2 in Section 4 is that the BC speech in folder 1 is obtained by resampling 4 kHz speech, while the BC speech in folder 2 is directly collected at 16 kHz.
The advantage of resampling 4 kHz speech is that it reduces the transmission burden on Bluetooth and extends recording time.
BC speech collected directly at a 16 kHz sampling rate is significantly affected by electromagnetic interference, which impacts its audibility. However, this type of noise can be effectively filtered out in the speech enhancement model, as electromagnetic interference is a form of steady-state noise.
Furthermore, the acoustic features of BC speech above 2 kHz are attenuated to nearly zero. Therefore, resampling from 4 kHz to 16 kHz does not result in a loss of acoustic characteristics.
In the future, we will implement electromagnetic shielding to achieve higher-quality BC speech.