WakeNet, which is a wake word engine built upon neural network, is specially designed for low-power embedded MCUs. Now, the WakeNet model supports up to 5 wake words.
Please see the flow diagram of WakeNet below:
-
Speech Feature:
The WakeNet uses MFCC to obtain the features of the input audio clip (16 KHz, 16 bit, single track). The window width and step width of each frame of the audio clip are both 30 ms. -
Neural Network:
Now, the neural network structure has been updated to the sixth edition, among which,- WakeNet1 and WakeNet2 had been out of use.
- WakeNet3 and WakeNet4 are built upon the CRNN structure.
- WakeNet5 and WakeNet6 are built upon the Dilated Convolution structure.
-
Keyword Triggering Method
For continuous audio stream, we calculate the average recognition results (M) for several frames and generate a smoothing prediction result, to improve the accuracy of keyword triggering. Only when the M value is larger than the set threshold, a triggering command is sent.
-
How to select the WakeNet model
Go to
make menuconfig
, navigate toComponent config
>>ESP Speech Recognition
>>Wake word engine
. See below: -
How to select the wake words
Go tomake menuconfig
, and navigate toComponent config
>>ESP Speech Recognition
>>Wake words list
. See below:Note that, the
customized word
option only supports WakeNet5 and WakeNet6. WakeNet3 and WakeNet4 are only compatible with earlier versions. If you want to use your own wake words, please overwrite existing models inwake_word_engine
directory with your own words model. -
How to set the triggering threshold
- The triggering threshold (0, 0.9999) for wake word can be set to adjust the accuracy of the wake words model. The threshold can be configured separately for each wake words if there are more than one words supported in a model.
- The smaller the triggering threshold is, the higher the risk of false triggering is (and vice versa). Please configure your threshold according to your applications.
- The wake word engine predefines two thresholds for each wake word during the initialization. See below:
typedef enum { DET_MODE_90 = 0, //Normal, response accuracy rate about 90% DET_MODE_95 //Aggressive, response accuracy rate about 95% } det_mode_t;
- Use the
set_det_threshold()
function to configure the thresholds for different wake words after the initialization.
-
How to get the sampling rate and frame size.
- Use
get_samp_rate
to get the sampling rate of the audio stream to be recognized. - Use
get_samp_chunksize
to get the sampling point of each frame. The encoding of audio data issigned 16-bit int
.
- Use
Model Type | Parameter Size | RAM | Average Running Time per Frame | Frame Length |
---|---|---|---|---|
Quantized WakeNet3 | 26 K | 20 KB | 29 ms | 90 ms |
Quantised WakeNet4 | 53 K | 22 KB | 48 ms | 90 ms |
Quantised WakeNet5 | 41 K | 15 KB | 7 ms | 30 ms |
Distance | Quiet | Stationary Noise (SNR = 5 ~ 10 dB) | Speech Noise (SNR = 5 ~ 10 dB) | AEC Interruption (-5 ~ -10 dB) |
---|---|---|---|---|
1 m | 97% | 90% | 88% | 89% |
3 m | 95% | 85% | 75% | 73% |
False triggering rate: 1 time in 20 hours
Note: We use the ESP32-LyraT-Mini development board and the WakeNet5 model in our test. The performance is limited because ESP32-LyraT-Mini only has one microphone. We expect a better recognition performance when more microphones are involved in the test.
For details on how to customize your wake words, please see Espressif Speech Wake Word Customization Process.