TensorFlow Lite Tutorial Part 3: Speech Recognition on Raspberry Pi

553

2020-03-02 | By ShawnHymel

In the previous tutorial, we trained a convolutional neural network (CNN) using TensorFlow and Keras to respond to the spoken word “stop.” We saved that model into a file that we will read and convert to a TensorFlow Lite model file in this tutorial.

The TensorFlow Lite model file differs from a regular TensorFlow model file in that the weights and operations are stored as a FlatBuffer in the TensorFlow Lite file. A FlatBuffer is a special type of storage container that allows large amounts of data to be read in chunks from flash storage. This allows processors to stream the data without needing to load it all into memory first. Additionally, TensorFlow Lite model files are optimized for storage, which means they are perfect for use in embedded systems, like single board computers and microcontrollers.

In the rest of this tutorial, we will develop a Python program for a Raspberry Pi that reads the TensorFlow Lite model file and uses it to perform wake word recognition in real time. We can do this to develop our own voice assistant hardware, like the Amazon Echo, or create a new type of hardware interface.

If you prefer video, this tutorial can be viewed on YouTube here:

All code for this tutorial (and the previous tutorials in this series) can be found in this GitHub repository.

Creating a TensorFlow Lite Model File

The first step is to create a TensorFlow Lite model file. TensorFlow has a built-in command that we can call from within Python to handle the conversion for us. We just need to write a quick script.

TensorFlow Lite conversion

From there, we can copy the TensorFlow Lite model file (.tflite) to our Raspberry Pi. That will allow us to read it as a regular file in our real-time inference program.

In a new Python file or Jupyter Notebook, enter the following code. Make sure that the keras_model_filename points to the location of the .h5 file we created in the previous tutorial.

Copy Code

from tensorflow import lite
from tensorflow.keras import models

# Parameters
keras_model_filename = 'wake_word_stop_model.h5'
tflite_filename = 'wake_word_stop_lite.tflite'

# Convert model to TF Lite model
model = models.load_model(keras_model_filename)
converter = lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()
open(tflite_filename, 'wb').write(tflite_model)

When you run this code, it should convert the Keras model into a TensorFlow Lite model file.

Real-Time Inference Overview

“Inference” (in the machine learning vocabulary) is the process of inferring meaning from a new set of unseen data. We want to perform real-time inference on the Raspberry Pi so that it will respond to spoken words as they occur. To do that, we need to copy the tflite model file to the Raspberry Pi. Then, we’ll have a microphone always listening, which converts every second of audio to the mel frequency cepstral coefficients (MFCCs). These MFCCs are our features which will be sent to an inference engine running our tflite model file.

Real-time inference

The output of the model gives us a score (essentially, the probability) of how close the model thinks the features match the wake word, “stop.” If that score is over some threshold (we’ll set it at 50%), we can have the Raspberry Pi perform some action. In this case, we’ll just flash an LED.

One concern is that the wake word might be split between 2 separate 1-second windows. To prevent this from happening, we can do a “sliding window” in code. We’ll take a 1-second slice to use as raw data for our inference engine, move the window up 0.5 seconds and perform inference again. This means we have to do twice as much computation with some overlap in the data, but it helps prevent words getting lost between windows.

For example, we might take the first 1 second of this clip, compute the MFCCs, and run it through our tflite model. That would give us some low output score, as the spoken word isn’t present.

Sliding window

We would then move the window over by 0.5 seconds and repeat the process. The word still isn’t present, so we would get another low score. Notice that “stop” is in the last part of the window, but it’s not the full word.

Sliding window

We move the window over again by 0.5 seconds and repeat. This time, the wake word is present in the audio, so we get a much higher score from the inference section. At this point, we would trigger our action of flashing an LED (or anything else you wanted).

Sliding window

Running Inference

First, copy the .tflite file over to your Raspberry Pi. You’ll want the file to be in the same directory as your code (or you’ll want to update the path in the code to point to the file).

You will want to plug in a USB microphone into your Raspberry Pi and install any necessary drivers. I recommend this USB microphone and to follow these instructions from Adafruit.

On the Raspberry Pi, make sure you are running Python 3 and Pip 3. Any code I show with python or pip, assume it is version 3. Install the following packages

Python -m pip install sounddevice numpy scipy timeit python_speech_features

To install the TensorFlow Lite interpreter, you will need to point pip to the appropriate wheel file. Go to the TensorFlow Lite quickstart guide and find the table showing the available wheel files. Copy the URL for the TensorFlow Lite package for your processor. For a Raspberry Pi running Raspbian Buster, this will likely be the ARM 32 package for Python 3.7. Install TensorFlow Lite with the following:

Python -m pip install <URL to TensorFlow Lite package>

You’ll want to connect an LED and limiting resistor (100 - 1k Ω) between board pin 8 (GPIO14) and a Ground pin. See here for the Raspberry Pi pinout guide.

Add the following code to a new Python file located in the same directory as your .tflite file:

Copy Code

"""
Connect a resistor and LED to board pin 8 and run this script.
Whenever you say "stop", the LED should flash briefly
"""

import sounddevice as sd
import numpy as np
import scipy.signal
import timeit
import python_speech_features
import RPi.GPIO as GPIO

from tflite_runtime.interpreter import Interpreter

# Parameters
debug_time = 1
debug_acc = 0
led_pin = 8
word_threshold = 0.5
rec_duration = 0.5
window_stride = 0.5
sample_rate = 48000
resample_rate = 8000
num_channels = 1
num_mfcc = 16
model_path = 'wake_word_stop_lite.tflite'

# Sliding window
window = np.zeros(int(rec_duration * resample_rate) * 2)

# GPIO 
GPIO.setwarnings(False)
GPIO.setmode(GPIO.BOARD)
GPIO.setup(8, GPIO.OUT, initial=GPIO.LOW)

# Load model (interpreter)
interpreter = Interpreter(model_path)
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
print(input_details)

# Decimate (filter and downsample)
def decimate(signal, old_fs, new_fs):
    
    # Check to make sure we're downsampling
    if new_fs > old_fs:
        print("Error: target sample rate higher than original")
        return signal, old_fs
    
    # We can only downsample by an integer factor
    dec_factor = old_fs / new_fs
    if not dec_factor.is_integer():
        print("Error: can only decimate by integer factor")
        return signal, old_fs

    # Do decimation
    resampled_signal = scipy.signal.decimate(signal, int(dec_factor))

    return resampled_signal, new_fs

# This gets called every 0.5 seconds
def sd_callback(rec, frames, time, status):

    GPIO.output(led_pin, GPIO.LOW)

    # Start timing for testing
    start = timeit.default_timer()
    
    # Notify if errors
    if status:
        print('Error:', status)
    
    # Remove 2nd dimension from recording sample
    rec = np.squeeze(rec)
    
    # Resample
    rec, new_fs = decimate(rec, sample_rate, resample_rate)
    
    # Save recording onto sliding window
    window[:len(window)//2] = window[len(window)//2:]
    window[len(window)//2:] = rec

    # Compute features
    mfccs = python_speech_features.base.mfcc(window, 
                                        samplerate=new_fs,
                                        winlen=0.256,
                                        winstep=0.050,
                                        numcep=num_mfcc,
                                        nfilt=26,
                                        nfft=2048,
                                        preemph=0.0,
                                        ceplifter=0,
                                        appendEnergy=False,
                                        winfunc=np.hanning)
    mfccs = mfccs.transpose()

    # Make prediction from model
    in_tensor = np.float32(mfccs.reshape(1, mfccs.shape[0], mfccs.shape[1], 1))
    interpreter.set_tensor(input_details[0]['index'], in_tensor)
    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])
    val = output_data[0][0]
    if val > word_threshold:
        print('stop')
        GPIO.output(led_pin, GPIO.HIGH)

    if debug_acc:
        print(val)
    
    if debug_time:
        print(timeit.default_timer() - start)

# Start streaming from microphone
with sd.InputStream(channels=num_channels,
                    samplerate=sample_rate,
                    blocksize=int(sample_rate * rec_duration),
                    callback=sd_callback):
    while True:
        pass

When you run the code, you should see the output of the inference engine. It should be below 0.5 during times of silence and when you say any word but “stop.” If you say “stop,” it should show a score of above 0.5 and let you know it heard “stop.” The LED should also flash.

Note that this is far from a perfect model. Words close to “stop,” like “stuff,” can trigger the wake word action. Additionally, because the model was not trained on silence or ambient noise, the lack of spoken words can make the output probability higher than necessary. Finally, our sliding window system is quite crude: it requires re-computing MFCCs from the same 0.5-second chunks of audio, and it might trigger on variations of the wake word, like “stopped” or “stopping.”

However, this proves to be a good starting point to experiment with Edge AI systems that run on embedded Linux! Feel free to fix some of the above issues, train your own wake word, or try training with different features.

Going Further

Hopefully, these tutorials can act as a starting point for your Edge AI projects! As an example, you might create a secondary emergency stop circuit for your milling machine that responds to someone shouting “stop!” by turning off the machine.

Milling machine AI emergency stop