Skip to main content

Speech-To-Text

Speech-to-text models convert sound to text - and are critical to features like AI assistants, dictation, voice commands, and automatic note taking. Here we'll take a look at running speech-to-text in edge devices.

speech

OpenAI's Whisper

Whisper is a speech model by OpenAI, released in 2022. A wider variety of training data used in its development has led to improved recognition of accents, background noise and jargon compared to previous approaches.

We have step by step tutorial to run Whisper using Whisper.cpp , follow the Whisper tutorial. If you want to run streaming Whisper, follow the Whisper Streaming tutorial.

Useful Sensors's Moonshine

We use Moonshine – a speech-to-text model built for devices with limited resource. Moonshine delivers fast, accurate results and gets word error rates (WER) similar to OpenAI’s Whisper but using 5x less compute.

Moonshine scales its compute needs with the length of the audio. Shorter segments run much faster. For example, it processes 10-second clips five times faster than Whisper while keeping similar or lower WER.

Moonshine comes in two sizes:

  • Tiny: ~190 MB
  • Base: ~400 MB

You can also download base quantized Moonshine models from here: Moonshine HuggingFace

Transcribe an Audio File

note

This assumes you are familiar with setting up your Astra board. If not, please refer to the setup tutorial.

info

This quick guide is compatible with all SL16xx boards. While inference may vary, the steps remain the same across all Astra SL-Series processors.

Now, once Prerequisites are done, Run the following to transcribe the jfk.wav file which is on your board:

python3 -m speech_to_text.moonshine 'samples/jfk.wav'

The output will be a transcription of the speech in the audio:

And so my fellow Americans ask not what your country can do for you ask what you can do for your country

Live Captions

Use a USB microphone (Webcam’s mic/ Headphone) to run live captions:

python3 -m speech_to_text.pipeline

This command starts real-time transcription of the incoming audio. Press CTRL + C to quit.

Pipeline

The SpeechToTextPipeline example can be used as a simple input stream for any application which uses human speech as an input, with a handler being called with every utterance detected:

from speech_to_text.pipeline import SpeechToTextPipeline

def handle_results(text, inference_time):
if text:
print(f"{text} {inference_time*1000:.0f}")

pipe = SpeechToTextPipeline(
model="base",
handler=handle_results,
echo=False
)

pipe.run()