Audio: MFCC: Add Voice Activity Detection based on Mel spectrum#10782
Audio: MFCC: Add Voice Activity Detection based on Mel spectrum#10782singalsu wants to merge 3 commits into
Conversation
|
This is still WIP. I'd like to add a better audio feature header to the fake PCM stream. In successive PRs should start to use the compress PCM type for MFCC output data. The MFCC config blob could enable for VAD mode discontinuous data. E.g. once per second background noise Mel spectrum values, for speech detected at FFT hop rate, e.g. every 10 ms. |
There was a problem hiding this comment.
Pull request overview
This PR introduces an optional MFCC Voice Activity Detection (VAD) feature that runs on the MFCC component’s Mel log spectrum and embeds a VAD flag into the MFCC/Mel output stream, along with updated host-side tuning/decoding tooling and documentation.
Changes:
- Add a new
mfcc_vadmodule (state, initialization, per-frame update) and wire it into MFCC Mel-log-spectrum processing. - Insert a per-frame VAD flag into the MFCC output stream immediately after the magic header word (gated by a new Kconfig option).
- Update tuning tools/documentation: add a live DSP-VAD-triggered Whisper transcription script, migrate README to Markdown, and extend
decode_mel.mto extract VAD.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| src/include/sof/audio/mfcc/mfcc_vad.h | New public header for VAD state + API and tuning constants |
| src/audio/mfcc/mfcc_vad.c | New VAD implementation (noise floor tracking + weighted energy delta + hangover) |
| src/include/sof/audio/mfcc/mfcc_comp.h | Extend MFCC component state to carry VAD state and output bookkeeping |
| src/audio/mfcc/mfcc_common.c | Run VAD during Mel processing and emit VAD flag in stream output |
| src/audio/mfcc/mfcc_setup.c | Initialize/free VAD resources during MFCC setup/teardown |
| src/audio/mfcc/Kconfig | Add CONFIG_COMP_MFCC_VAD option controlling build + format change |
| src/audio/mfcc/CMakeLists.txt | Conditionally compile mfcc_vad.c |
| src/arch/host/configs/library_defconfig | Enable VAD in host library defconfig |
| src/audio/mfcc/tune/sof_mel_to_text_live_dsp_vad.py | New live capture + Whisper transcription tool using DSP-embedded VAD |
| src/audio/mfcc/tune/README.md | New Markdown documentation (replaces README.txt) |
| src/audio/mfcc/tune/decode_mel.m | Extend Mel decoder to parse VAD flag and plot it |
Comments suppressed due to low confidence (1)
src/audio/mfcc/mfcc_common.c:297
vad_pendingis only set forstate->mel_only. If VAD is meant to be emitted for all MFCC output frames (including cepstral output), this needs to be set for the non-mel_only path too; otherwise, please update docs to state the VAD flag is only present in Mel-log-spectrum output streams.
if (state->mel_only) {
state->out_data_ptr = state->mel_spectra->data;
#ifdef CONFIG_COMP_MFCC_VAD
state->vad_pending = true;
#endif
|
I think I'll remove the CONFIG_COMP_MFCC_VAD and build it always. Then it's simpler to make it a permanent part of the magic header. The configuration blob for Mel mode can enable computing it,while in MFCC mode it will be zeros unless enabled there also with blob. Then the parsing scripts can always use the same data format. |
|
Adding more features --> draft |
@singalsu another option would be to keep the |
True, but I find testing all the kconfig and blob config permutations very time consuming. So better to reduce variation when possible. VAD is so small that I don't think it matters. And the blob can switch it off like it now does for MFCC ceps mode. |
| if (config->enable_vad) | ||
| mfcc_vad_update(&cd->vad, state->mel_log_32); | ||
|
|
||
| /* Populate data header for this output frame */ | ||
| state->header.energy = cd->vad.energy; | ||
| state->header.noise_energy = cd->vad.noise_energy; | ||
| state->header.vad_flag = cd->vad.is_speech ? 1 : 0; | ||
|
|
||
| /* Increment hop counter at end of hop processing */ | ||
| state->hop_count++; | ||
|
|
||
| /* Send notification when VAD state changes */ | ||
| if (config->update_controls) { |
| /* Initialize VAD switch control notification if enabled */ | ||
| if (cd->config && cd->config->update_controls) { | ||
| ret = mfcc_ipc_notification_init(mod); | ||
| if (ret < 0) | ||
| goto err; | ||
|
|
||
| cd->vad_prev = false; | ||
| } |
|
|
||
| proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE) |
There was a problem hiding this comment.
This is just a demo script and this issue is going back and forth. I've not seen issues with this version of code so I prefer to keep it as is.
| break | ||
| continue | ||
|
|
||
| buf += data |
|
|
||
| def decode_mel_frame(raw_ints): | ||
| """Convert 80 int32 Q9.23 values to float32 mel coefficients.""" | ||
| return raw_ints.astype(np.float64) / (2 ** SOF_Q_FORMAT) |
There was a problem hiding this comment.
Yep, float32 is sufficient for this.
|
|
||
| figure | ||
| subplot(2,1,1); | ||
| level = sum(mel(:,:)); |
There was a problem hiding this comment.
Yep, forgot it from earlier test.
Add mfcc_vad module with A-weighted energy-based voice activity detection that operates on the Mel log spectrum produced by the MFCC component. The algorithm tracks a per-bin noise floor with instant-down and slow-rise behavior, then computes a weighted energy delta above the floor. Speech is declared when the delta exceeds a threshold (0.35 in Q9.23) with a 20-frame hangover to prevent rapid toggling. The VAD is gated on the new enable_vad flag in sof_mfcc_config. Add struct mfcc_data_header with six int32 fields (magic, frame_number, reserved, energy, noise_energy, vad_flag) prepended to every output frame in all format paths (S16, S24, S32). This replaces the previous magic-word-only header. The header carries the VAD decision and energy values from the DSP for downstream consumers. Extend sof_mfcc_config in user/mfcc.h with reserved16[3] padding for 32-bit alignment, and new boolean fields enable_vad, enable_dtx, update_controls, and reserved_bool[5]. The config blob size increases from 104 to 116 bytes. Update Matlab/Octave decode scripts (decode_mel.m, decode_ceps.m, decode_all.m) and setup_mfcc.m for the expanded header and config struct. Regenerate topology2 configuration blobs (default.conf, mel80.conf) with the new blob size. Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
Add sof_mel_to_text_live_dsp_vad.py that captures mel spectrogram frames from ALSA with embedded DSP VAD flag and performs live speech-to-text transcription using OpenVINO Whisper. The script buffers mel frames during speech and triggers Whisper inference when silence is detected after speech. Capture runs continuously in a separate thread during inference to avoid frame drops. Replace the old README.txt with a comprehensive README.md that documents the MFCC tuning tools, testbench usage with run_mfcc.sh, output file formats, Matlab/Octave decode and plotting scripts, and the new live transcription workflow. Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
Add IPC4 notification that sends the VAD state to user space via a switch control whenever the VAD decision changes between speech and silence. The notification is initialized during prepare and sent from the audio processing path on VAD state transitions. The implementation follows the TDFB/sound_dose notification pattern: mfcc_ipc4.c contains the IPC4-specific notification init and send functions, while mfcc.c provides weak stubs so IPC3 builds link without the IPC4 dependencies. Add handling for SOF_IPC4_SWITCH_CONTROL_PARAM_ID in mfcc_get_config and mfcc_set_config so the kernel driver can read back the current VAD state after receiving a notification. The switch control is read-only from the DSP side. Both the notification init and the VAD state change detection are gated on the update_controls flag in the configuration blob struct. Add a switch control (mixer) to the MFCC topology2 widget definition for the VAD notification. Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
| state->magic_pending = false; | ||
| state->header_pending = false; | ||
| memset(&state->header, 0, sizeof(state->header)); | ||
| state->header.magic = MFCC_MAGIC; |
| ret = mfcc_ipc_notification_init(mod); | ||
| if (ret < 0) | ||
| return ret; |
| int32_t noise_energy; /**< Weighted noise floor energy in Q9.23 */ | ||
| int32_t vad_flag; /**< VAD decision: 1 = speech, 0 = silence */ | ||
| }; | ||
|
|
No description provided.