Figure 1: Target extraction using multi-channel signal processing
The figure above shows a sample multi-channel target enhancement scenario. Two competing speakers are recorded by a 5-channel microphone array in a noisy and reverberant environment T60 = 0.6 s, source and interferer are at a distance of 1.0m from the center of the array, outside the critical distance of room (≈ 0.84m)). The target localisation and subsequent extraction/enhancement using the approaches developed in are presented below for this example. Each speaker signal has the same power. The background noise is diffuse white noise mixed at 10 dB below the signal power. Note that the noise is correlated across the microphones.
The microphone signal: x1(n)
The output using delay-and-sum beamforming: yDSB(n)
The output using an adaptive soft-mask based on the target presence probability: ymask(n)
The output using a cepstro-temporally smoothed version [2] of the soft-mask above: ysmth(n)
The output using the parsimoniously excited generalised sidelobe canceller (PEG) algorithm: yPEG(n)
As expected, the DSB offers some enhancement, but as it is not actively canceling interference and noise, the enhancement is limited – in particular, the interferer is still clearly audible. The mask-based approach offers good noise and interference suppression. The quality of the target speech is also rather good, due to using a soft mask. However, some artifacts may be observed in the output. Also, there is one point in the signal where the target distortion is audible: the target has low energy in these frames, and is not well localized – leading to a sudden dip in the voice (towards the latter part of the sentence). This may impact intelligibility. The smoothed masks improve upon the target speech quality, however, this is at the cost of reduced noise and interference suppression. The PEG approach offers good performance in terms of interference, noise cancellation, and preservation of the target signal. It sounds the most natural.
[1] N. Madhu, “Acoustic source localization: Algorithms, applications and extensions to source separation”, Dissertation, Ruhr-Universität Bochum.
[2] N. Madhu, C. Breithaupt and R. Martin, “Temporal smoothing of spectral masks in the cepstral domain for speech separation”, Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2008.