Automatic Speech Recognition#

Automatic Speech Recognition (ASR) is a technique or an application that uses Machine Learning to process human speech and convert it into readable text. The field has grown exponentially over the past decade, with ASR systems used in popular applications such as Instagram for real-time captions, Spotify for podcast transcriptions, Zoom for meeting transcriptions, and so forth.

The product of a speech recognition system is a transcription. This can be a full verbatim transcription if it captures everything in the audio file, including pauses, filler words, laughter, and noises like a door slamming and a phone ringing. It’s used in situations where accuracy is critical and every small detail is relevant. In contrast, a clean verbatim transcription does not change the text’s meaning or paraphrase it, but it eliminates unnecessary words in the speaker’s speech. Non-verbal communication that does not add value to the content is left out, including filler words and stammering. The ultimate goal of this mode of transcription is to achieve a balance between readability and completeness. The degree of deletion depends on the purpose of the application.

There are two main approaches to Automatic Speech Recognition: a hybrid, more traditional, approach and an end-to-end, modern, Deep Learning approach.

Hybrid approach#

The traditional hybrid approach has been used for the past fifteen years. It is still the approach of choice for many because there is a lot of knowledge around how to build a robust model. The drawback is that this technology seems to have reached a plateau and it is very difficult to further improve its accuracy. To make transcription predictions, this approach combines at least three components: a lexicon model, an acoustic model, and a language model.

The lexicon model describes how words are pronounced phonetically. It requires a custom phoneme set for each language. These are handcrafted by expert phoneticians. The goal of the acoustic model is to predict which sound or phoneme is being spoken at each speech segment of the original speech. Finally, the language model is the representation (statistics) of language, i.e. the sequences of words that are most likely to be spoken in a given language. Its goal is to predict which words will follow on from the current words and with what probability.

There are several main downsides of the hybrid approach: firstly, it requires training three separate models, something which is very time and labor intensive. Secondly, the data required to train such data, especially for the acoustic model, is not abundantly available. Finally, the accuracy of the transcription generated with this technique has reached, as pointed out above, a plateau, and systems are difficult to be improved.

End-to-end approach#

The end-to-end approach is the latest approach to this task. It follows the steps of many other NLP tasks, such as machine translation. The core of an end-to-end approach is a single model with the ability to map directly input and output: in our case spoken language input and its corresponding transcription. The association between the two, i.e how to translate from the spoken words to their transcription, is learned by the algorithm when it is presented with a huge quantity of data pairs. Such pairs are normally in the form of an audio file of a sentence and its correct transcription.


It relies on a great amount of training data such as audio files as input with transcriptions as output.

It has to be pointed out that while an end-to-end model can be created with a single model, i.e. without a lexical model and a language model, in many cases end-to-end models still use a language model downstream. The reason is simple: because language model are a powerful representation of a given language, and are able to predict words in context, they increase substantially the accuracy of an ASR system. From this perspective, the end-to-end approach is also a hybrid model, but simpler than the classic one.

The advantages of end-to-end systems are two folds: they are easier to train and require less human labor than a traditional approach. As a consequence, they produce transcriptions that are more accurate than traditional models.

End-to-end models are known to be data hungry. State-of-the-art ASR systems are trained on 100,000 hours of raw audio and video data. However, obtaining human transcriptions for this same training data is almost impossible given the time constraints associated with human processing speeds.

This is where self-supervised deep learning systems can help. Without the need of labeled data (transcriptions for any audio) it is possible to build a so-called foundational model. This model can be fine-tuned for specific tasks (transcription) using a smaller amount of data, making it a more accessible approach to model building. This approach is very new and may open up new prospects, with ASR models to become more accurate and affordable.