Language Models#

Modern Natural Language Processing is based on the use of Language Models (LM). In most general terms a language model is a statistical tool that analyzes the pattern of human language, typically written, for the prediction of words.

Language models are nothing but a mathematical representation of language. Codifying language in mathematical terms may seem odd at first, but there is nothing magical about it. We know that the meaning of a word, for example, can be derived by the analysis of how words are combined in large language corpora. In 1957, John Rupert Firth famously summarized this principle as “you shall know a word by the company it keeps”. And this can be mathematically computed. Since then, a generation of translators and interpreters has grown up knowing how important corpus analysis and Firth’s principle is to understand and produce language in a professional context [Zanettin, 2002].

Language models are a snapshot of language use. After a general learning phase with big amounts of data (training) and, whenever needed, a task-specific learning (fine-tuning), LMs have mathematically encoded a model of a language, which basically means, again, word combinations. When a sentence is sent to a LM, it will be processed through some sort of word combinations analyses, not that different to corpus linguists at the time of Firth, but in an unprecedented size and statistical depth. The fact that this is only similar to what humans do is described in this paper from Emily Bender and Alexander Koller [Bender and Koller, 2020]. The main difference is simple: a LM’s ability is limited to the superficial knowledge of words and not to what those words represent in the real world. Humans, on the contrary, triangulate them with what they mean in the physical world. A big difference. Notwithstanding this astonishing limitation, this mathematical representation of language allows LM to transcribe spoken words, translate, and even create novel texts (see the famous GPT-3). And this in an astonishing quality for a tool that has no connection with the real world and its working only on the surface of a text!

From an algorithmic point of view, language models can be as simple as unigrams or n-grams frequencies as well as complex neural networks. To understand what it means to encode mathematically some information about language, let’s see the following abstract example.

How can we predict if a sentence makes sense in a simple mathematical way? Let’s have the following two sentences:

  • the cat chased the mouse

  • the shark chased the mouse

Which of the two sounds more natural, i.e. is more probable? Obviously the first one. We know this from our world knowledge. But how can a computer know this? In this case by simple word distributions. Cats are more common than sharks and cats and mice come often together in texts. While the language model is unaware of the physical reality of the two sentences [Bender and Koller, 2020], a simple calculation of distributions in a large corpus of texts of cat<>mouse and shark<>mouse will be enough to answer this question.

Let’s look now at a concrete example about a real-life algorithm to predict the next word of a text we are typing. This is what we commonly use in our cell phone or in a word processor for auto completion tasks.

To achieve the goal of getting word suggestions while we are typing, we can use a simple statistical language model based on n-grams, where a n-gram is a contiguous sequence of n items from a given sample of text. The language model will need to predict the probability of the word by knowing the text typed so far.

To train a n-gram model, it is sufficient to calculate the frequency of all words following a given n-gram in a large corpus of texts. Let’s say that our model is based on n=3, i.e on trigrams. To train our model, we identify all 2-word combinations in our corpus and calculate how many times a word follows those bigrams. We keep record of this in a simple table which would look like something like this:

Bigram

1st frequency

2nd frequency

3nd frequency

others

There is

a

one

the

I love

you

my

the

love my

family

life

dad

When a user types for example the words “I love”, a simple lookup in the table containing all the calculated frequencies will reveal that the most frequent word following this bigram is “you”. This word will be suggested to the user and represent a good guess of what the user wanted to type. Of course, there is no certainty that this was exactly what the user had in mind. For this reason, it is common practice to offer alternatives. These are the words with lower frequencies (“my” being the second most frequent word after “you”). Should the user select “my”, the same lookup approach will be iterated for the new bigram, “love my”: Probable candidates will be “family”, “life”, etc.

While this approach is very simple to implement, it has significant drawbacks. Real-life texts have deep context influencing the choice of the next word. Simple language models like the one presented above have no contextual knowledge beyond the sheer frequency of the words following the given n-gram. To improve the quality of prediction, it is possible to increase the size of the n-gram. However, this also means that the probability to find a long sequence of words in the text corpus, and therefore the ability to calculate the frequency of words coming after it, will be very low, making the approach useless (sparsity problem). To compensate for this, bigger text corpora are needed during training. However, a wall is soon hit, and no significant improvements are possible. More complex language models are required.

Nowadays, complex language models are based on neural networks (Transformers [Vaswani et al., 2017] being one of the most popular frameworks at the moment of writing). They are again probabilistic representations of language, but this time they are built using large neural networks (deep learning) that consider the context of words. Language models get trained with a huge quantity of raw documents, either written or spoken texts, and they learn autonomously (see What is Machine Learning) correlations between words. This kind of language models can learn many features of the language they have been exposed to. Many pre-trained general purpose language models are open source and can be freely used (see spaCy for a popular one). They can perform many tasks out-of-the-box, such as named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more. Language models can be trained for a single language (English, Farsi, etc.) or they can be trained in a multilingual manner, such as mBERT.

Recently, so-called Large Language Models (LLM) have been introduced. The sheer size of them allows LLM to generalize even more their knowledge of the language. Applications like human sounding text generation become possible. Given the computational costs needed to train them, these models are hard to replicate without significant capital, and generally only large corporations are able to create them. An example of LLM is OpenAI GPT-3 or Meta OPT [Zhang et al., 2022].

Large Language Models can perform impressive tasks. They, however, are also characterized by harmful flows. This will be discussed in the dedicated section Bias.

Bibliography#

1(1,2)

Emily M. Bender and Alexander Koller. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th annual meeting of the association for computational linguistics, 5185–5198. Association for Computational Linguistics, July 2020.

2

Federico Zanettin. Corpora in translation practice. Language Resources for Translation Work and Research, pages 10–14, January 2002.

3

Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. ArXiv, 2017.

4

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. OPT: Open pre-trained transformer language models. ArXiv, 2022.

Further reading#