Deepspeech basics

What is Deepspeech

From Mozilla's github repo for deepspeech:

"DeepSpeech is an open source Speech-To-Text engine, using a model trained by machine learning techniques based on Baidu's Deep Speech research paper. Project DeepSpeech uses Google's TensorFlow to make the implementation easier."

Virtual environment

First let's create a virtual environment for deepspeech

Install deepspeech

The only required package is deepspeech

Download Model

A pre-trained english model is available for download

Download audio files

You can download some example audio files

Run inference

We can now transcribe the audio file

If you ran the above command you should see something like "experience proofsless" if you are using the same model as me

So not perfect, but we can try it out on our own voice as well

Record a wav file

For deepspeech to run inference correctly you will need to record your voice with some specific parameters.

  • Sampling rate: 16 kHz
  • Channel: 1
  • Bit rate: 256 kb/s

We can achieve this using the sox package

If you're on Ubuntu:

Arch Linux:


After installing sox you should have access to the rec command, we will use this to record our voice

To begin recording you voice enter the following command

To make sure you have recorded the audio in the proper format we can install another package called mediainfo and run it like so:

You should see an output similar to the following:

Run inference

Now we can run inference on our own voice data

Wrapping up

In the next article I'll go over running inference on a GPU

Tagged in deepspeech