Machine Interpreting#

Machine interpreting, also called automatic speech translation, speech-to-speech translation or spoken language translation, is an emerging area of artificial intelligence that aims at building machines that are able to translate spoken texts from one language to another in real-time and for immediate consumption. This can be done in the modalities consecutive (sentence by sentence or paragraph by paragraph) or simultaneous (continuous translation without any interruption of the speaker). Machine interpreting shares many similarities with offline spoken language translation, which is for example used to translate pre-recorded videos.

Generally speaking, machine interpreting aims at reducing language barriers by means of oral language translation performed in an unsupervised manner. Machine interpretation can be used in informal settings, for example to book a table at a restaurant or to follow the live-stream of a youtuber, or in formal ones, for example to translate in real-time a lecture, a political speech, etc.

Machine interpreting has seen exceptional performance improvements over the past few years. The discipline started from the rather artificial problem of translating oral utterances recorded under controlled conditions, with restricted vocabularies, strong domain limitations and the necessity of a constrained speaking style. Nowadays, however, research as well as experimental and commercial applications are moving to the more ambitious task of translating real-life spoken language, without any particular constraints. The advances made in this domain can mostly be attributed to the use of modern machine learning algorithms, especially neural networks, and the availability of big language data (at least for some languages).

There are at least two approaches to machine interpreting: the cascading and the end-to-end approach.

Cascading approach#

In the cascading approach the process of oral translation is broken down into some well-defined sub-processes that can be modeled in computer programs. The simplest approach consists in using three separate components: automatic speech recognition (ASR) translates the spoken words into written text, machine translation (MT) translates the written text from one language to another, and text-to-speech synthesis (TTS) produces a spoken version of the translation. These components are applied one after the other in a process called cascade translation, where the output of a process becomes the input of the next one. This approach is used, for example, in the Google Translate App. Dividing the task into such a cascade of systems has some obvious advantages: it builds on available technologies, it is transparent as far as the sequence of the tasks is concerned and, last but not least, it is extensible: new components can be added to the basic sequence introduced above. For example, it is possible to add a sentence segmenter to divide the unfolding speech into smaller parts (sentences) in order to continuously translate a long speech in real-time.

However, cascading systems suffer from several shortcomings, such as error propagation, the use of large translation models which are commonly trained with written data and that, consequently, do not solve phenomena that occur in spoken language, such as hesitations, or the absence of reliable punctuation produced by ASR, which causes problems to the MT, and so forth. As a consequence, a noisy transcription provided by the ASR (for example because of typical ASR errors, such as wrong disambiguation of homophones, or performance errors, i.e. disfluencies) may impair the successive MT process and so forth.

End-to-end approach#

The end-to-end approach applies similar machine learning techniques used for MT or ASR directly to bilingual speech data, i.e. to original speech and the translated speech in order to create models that are able to directly translate speech from one language into speech in another language, without relying on an intermediate text representation, i.e. the transcription. Since systems do not divide the task into separate steps, they may provide a few advantages over the cascading system described above, including faster speed, avoiding compounding errors between recognition and translation, better handling of proper names, and so forth. Being such a technique in its infancy, however, the output quality still lags behind conventional cascade systems.

Between these two poles, hybrid systems can be created by merging together in an end-to-end fashion the subcomponents of a cascading system. A promising approach, for example, is to combine speech-to-text translation (i.e. a single model that receives as input the speech and directly produces a written translation of the original) with a speech synthesis module.

Challenges and the future#

Despite the quality improvements reported recently, machine interpreting still has to cope with many challenges of real-life multilingual communication in order to reach a “good enough” quality to make its use suitable for certain contexts. Among others, machine interpreting still operates exclusively at a pure linguistic level. This means that it only processes the “surface” of communication, which is its language. It is not able to make sense of typical aspects of verbal communication, such as inference from the situational context, interpretation of prosodic features, correction of imperfect speeches, pragmatics, to name just a few. While progress can be seen in many of these subareas, it must be acknowledged that real-life communication is a very complex phenomenon.

While the required level of understanding is still out of reach of AI, there is still potential to improve interpreting systems using state-of-the-art NLP. They can take advantage, for example, of improvements in the precision of the three basic components (ASR, MT, STT), for example by means of domain adaptation, by adopting computational models that have been trained on real-life spoken communication, and by adding new layers of elaboration, for example, speaker diarization (the identification of an individual person based on characteristics found in the unique voice qualities), emotion recognition from linguistic and paralinguistic properties of speech, better source speech segmentation, to name just a few.

Solutions that may lead to a true understanding of language, situation, communication goals, etc., and therefore to a better translation of spoken language, are still far to come. However, even without significant new advances in NLU, the exploitation of existing and emerging technologies is supposed to lead to the use of machine interpreting in real-life scenarios. Depending on contexts and users’ expectations, such systems may prove to be good enough for some form of multilingual communication. A mixed scenario where both machines and humans will deliver interpretation services may become reality.

Bibliography#

Further reading#