WakeNet

WakeNet, which is a wake word engine built upon neural network, is specially designed for low-power embedded MCUs. Now, the WakeNet model supports up to 5 wake words.

Overview

Please see the flow diagram of WakeNet below:

Speech Feature:
The WakeNet uses MFCC to obtain the features of the input audio clip (16 KHz, 16 bit, single track). The window width and step width of each frame of the audio clip are both 30 ms.
Neural Network:
Now, the neural network structure has been updated to the sixth edition, among which,
- WakeNet1 and WakeNet2 had been out of use.
- WakeNet3 and WakeNet4 are built upon the CRNN structure.
- WakeNet5 and WakeNet6 are built upon the Dilated Convolution structure.
Keyword Triggering Method
For continuous audio stream, we calculate the average recognition results (M) for several frames and generate a smoothing prediction result, to improve the accuracy of keyword triggering. Only when the M value is larger than the set threshold, a triggering command is sent.

API Introduction

How to select the WakeNet model

Go to make menuconfig, navigate to Component config >> ESP Speech Recognition >> Wake word engine. See below:
How to select the wake words
Go to make menuconfig, and navigate to Component config >> ESP Speech Recognition >> Wake words list. See below:

Note that, the customized word option only supports WakeNet5 and WakeNet6. WakeNet3 and WakeNet4 are only compatible with earlier versions. If you want to use your own wake words, please overwrite existing models in wake_word_engine directory with your own words model.
How to set the triggering threshold
1. The triggering threshold (0, 0.9999) for wake word can be set to adjust the accuracy of the wake words model. The threshold can be configured separately for each wake words if there are more than one words supported in a model.
2. The smaller the triggering threshold is, the higher the risk of false triggering is (and vice versa). Please configure your threshold according to your applications.
3. The wake word engine predefines two thresholds for each wake word during the initialization. See below:
```
typedef enum {
    DET_MODE_90 = 0,  //Normal, response accuracy rate about 90%
    DET_MODE_95       //Aggressive, response accuracy rate about 95%
} det_mode_t;
```
1. Use the set_det_threshold() function to configure the thresholds for different wake words after the initialization.
How to get the sampling rate and frame size.
- Use get_samp_rate to get the sampling rate of the audio stream to be recognized.
- Use get_samp_chunksize to get the sampling point of each frame. The encoding of audio data is signed 16-bit int.

Performance Test

1. Resource Occupancy(ESP32)

Model Type	Parameter Size	RAM	Average Running Time per Frame	Frame Length
Quantized WakeNet3	26 K	20 KB	29 ms	90 ms
Quantised WakeNet4	53 K	22 KB	48 ms	90 ms
Quantised WakeNet5	41 K	15 KB	7 ms	30 ms

2. Performance

Distance	Quiet	Stationary Noise (SNR = 5 ~ 10 dB)	Speech Noise (SNR = 5 ~ 10 dB)	AEC Interruption (-5 ~ -10 dB)
1 m	97%	90%	88%	89%
3 m	95%	85%	75%	73%

False triggering rate: 1 time in 20 hours

Note: We use the ESP32-LyraT-Mini development board and the WakeNet5 model in our test. The performance is limited because ESP32-LyraT-Mini only has one microphone. We expect a better recognition performance when more microphones are involved in the test.

Wake Word Customization

For details on how to customize your wake words, please see Espressif Speech Wake Word Customization Process.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

WakeNet

Overview

API Introduction

Performance Test

1. Resource Occupancy(ESP32)

2. Performance

Wake Word Customization

Files

README.md

Latest commit

History

README.md

File metadata and controls

WakeNet

Overview

API Introduction

Performance Test

1. Resource Occupancy(ESP32)

2. Performance

Wake Word Customization