Wakeword Detection App — on cAInvas

Train a deep learning model to respond to a specific word.

Photo by Admin on Crowd4Test

Audio wake word systems respond to a specific phrase (Cainvas, in this case). You may use your own word as a wake word which you can use later on to turn up your own IoT device.

The challenges in training models for low power devices are the available memory and computational resources. The model runs on-device without cloud connectivity.

In this article, the audio wake word is a specific word (hotword). The notebook takes background and hotword recordings as input for training the model.

Implementation of the idea on cAInvas — here!

The Dataset

The dataset uses recordings from the user to train the model and test it as well.

Background: Recordings of your usual background sounds, could be something as simple as someone reading a book or just talking to each other.

Hotword: Recordings of you spelling out your wake word (Cainvas, per say) separated by 2 second increments. This is the hotword that you will use to wake your system up.

Noise: We add silence and noise as in the following to enhance the quality of the dataset.

https://gist.github.com/arnabchakraborty97/abcaf761ad785cae8955d8fe99af1c5c

The total length of the labels is now 471.

The dataset is then divided into training and test sets at a ratio of 9:1.

Note
1. Quality of resulting model depends on quality of dataset. For small dataset, false positives are expected to occur more (model predicts background noise as wake word).
2. Quality of dataset can be further improved by having variety of speakers recording in different environments with different levels of background noise.

The model

We use a spectrogram operator at the beginning of the model. It uses STFT (Short-time Fourier transform) which is a Fourier-related transform used to determine the sinusoidal frequency and phase content of local sections of a signal as it changes over time.

The output is then fed to a chain of convolution layers which is then passed through a dense layer to give a prediction at the last sigmoid activation layer.

https://gist.github.com/arnabchakraborty97/f82055c7b139ec50917f5d088b48e09c

The model is compiled using the binary cross-entropy loss function because the final layer of the model has sigmoid activation function. The Adam optimizer is used and the accuracy of the model is tracked over epochs.

The model achieved an accuracy of ~95.8%.

The metrics and predictions

A clip is used to test the model. The clip is a recording of the user talking and using the hotword now and then to see if the model is able to detect it properly. The graph below represents detection of the hotword as a high and the rest as low.

deepC

deepC library, compiler, and inference framework are designed to enable and perform deep learning neural networks by focussing on features of small form-factor devices like micro-controllers, eFPGAs, CPUs, and other embedded devices like raspberry-pi, odroid, Arduino, SparkFun Edge, RISC-V, mobile phones, x86 and arm laptops among others.

Compiling the model using deepC —

https://gist.github.com/arnabchakraborty97/af6ce0f354c54bd6a400930f05b32484

Head over to the cAInvas platform (link to notebook given earlier) to run and generate your own .exe file!

Credit: Rohit Sharma

Written by: Arnab Chakraborty