stefan-gloor.ch

Speech Recognition on STM32 using Machine Learning

Composite image of a microphone icon, a waveform and an STM32 development board

As part of a university course, I worked on porting a machine learning algorithm to an STM32 Cortex-M4 microcontroller for simple speech recognition. It can detect simple spoken keywords such as “yes”, “no”, “left”, “right” etc. For this, I used the TensorFlow Lite framework, which makes it very easy to design and train a model on the PC using TensorFlow and then transfer it to the TFLite framework running on the microcontroller.

The neural network operates on spectrograms, so the microphone recordings need to be preprocessed. For this, I convert the waveforms using short-time fourier transform. This works by using a sliding window over the input signal and applying an FFT to it. This contributes one column to the spectrogram. The window is then slided to the right by half of the window size and the process is repeated until the spectrogram is complete.

Explanation of STFT
Short-time fourier transform to create a spectrogram.

To implement this on the microcontroller, I used the CMSIS-DSP library for the FFT and the Hanning window function. I compared the resulting spectrograms with the ones calculated on the computer (used to train the model) to make sure they were correct. As the neural network operates on 8-bit integers, the spectrograms need to be quantized.

Spectrograms
Spectrograms calculated (from left to right): using TensorFlow, using numpy (float, uint8), on the microcontroller (float, uint8).

Once the preprocessing works on the microcontroller, the TensorFlow model can be deployed to it. For this, the model is quantized and converted into a “tflite” model, which can be converted into a C array and compiled into the firmware. Once the TensorFlow Lite runtime compiles and runs on the target platform, it is straightforward to use and it was surprisingly easy to get the model to run on the STM32.

In the final implementation, the model achieves an overall accuracy of about 80%. What really helped with performance was to use the optimized TFLite kernel which uses CMSIS-NN which uses SIMD instructions of the Cortex-M4. This brings down to the inference time to about 180 ms.

Confusion matrix of the model performance
Confusion matrix showing the accuracy of the final implementation.

I used Git submodules to separate all the dependencies from the application code, allowing them to be easily updated. Also, there is no dependency on the STM32 CubeIDE; instead, I use CMake as a build system. Because of this, it was easy to set up a CI pipeline to make sure the building and model training is repeatable.