Skip to main content

On-device Speech-to-Text (STT)

· 4 min read
Aditya Sahu
Aditya Sahu
AI @ Synaptics

In the realm of real-time speech recognition, Whisper models have been state of the art, offering developers the ability to transcribe audio efficiently. The ability to run whisper.cpp locally on these devices opens up exciting possibilities, especially for applications requiring low latency, high reliability, and independent operation from cloud infrastructure. In this blog, you will explore the high-level approach to bring whisper.cpp to life on an SL1680 processor using the Astra Machina development kit.

image

Cross-Compiling Whisper Binary for Machina

Deploying whisper.cpp on an embedded device necessitates cross-compilation, a process where code is compiled on one platform (the host) to run on a different platform (the target). Here's how you can approach cross-compiling whisper.cpp for the SL1680 processor on Astra Machina, which runs Yocto Linux:

The first step is to establish a cross-compilation environment on your host machine (let's say Ubuntu). This involves selecting the appropriate cross-toolchain that matches the architecture of Astra Machina. You can get a pre-built toolchain for Astra from here.

With this pre-built toolchain, you can set up a Poky environment on your host machine. Once that's in place, you move on to building the binary, which basically means turning the whisper.cpp source code into something your SL1680 processor can actually run. This is what cross-compiling is all about - building code on one machine so it works on a different one.

Deploying whisper.cpp on the Machina Development Kit

With the binary ready, the next phase involves deploying it onto the Machina Board. The SL1680 processor can easily run the binary on the CPU. You might need to employ optimizations to manage workloads more efficiently on the GPU.

You also need to download a Whisper model to the Machina board from HuggingFace.

We evaluated the performance of various models in the whisper.cpp family when running the jfk.wav file containing the iconic speech by President John F. Kennedy: "And so my fellow Americans ask not what your country can do for you ask what you can do for your country."

The metrics are summarized in the table below:

Model NameMemory UsedLoad Time [ms]Encode Time [ms]Decode Time [ms]Total Time [ms]
ggml-tiny.en750 MB129.504463.7319.795851.29
ggml-tiny.en-q8_0706 MB128.193103.686.524088.51
ggml-base.en892 MB170.0910964.45104.6613229.58
ggml-base.en-q5_1808 MB125.378299.7826.6010015.75
info

Length of input WAV file is 11 seconds.

Word Error Rate (WER) for all models is 0%.

For all experiments, the number of threads is 4.

All English models have been used for this experiment.

Observations

The choice of model should be guided by the specific application requirements. For real-time, resource-constrained environments, smaller or quantized models strike the best balance between performance and accuracy.

  • The results clearly demonstrate the trade-off between model size and performance. The smaller models (ggml-tiny.en and ggml-tiny.en-q8_0) consume significantly less memory and CPU, resulting in faster load and processing times. These models are ideal for real-time applications on resource-constrained devices like the SL1680 processors.

  • The quantized versions of the models (ggml-tiny.en-q8_0 and ggml-base.en-q5_1) offer a noticeable reduction in memory usage and CPU load, translating into quicker processing times. This makes them particularly suitable for edge AI applications where efficiency is paramount.

  • The larger models (ggml-small.en and ggml-base.en) may provide better accuracy at the cost of increased resource consumption. These models are more appropriate when the target device has ample resources or when the highest transcription accuracy is critical.

Deploying advanced AI models like whisper.cpp on embedded systems is a testament to the growing potential of edge computing. By harnessing the power of the SL1680 processor on the Astra Machina development kit, you can bring sophisticated speech recognition capabilities to devices that operate independently of cloud infrastructure, opening up new possibilities for applications that require low latency and high reliability.

Further Reading

For those eager to explore further, here are some detailed, step-by-step tutorials to guide you through the process.