Decoding Data & AI: How do Large Language Models actually understand our language?
Have you ever asked yourself how these AI systems actually do this? Well, here is a fast and easy explanation for you. Once you understand these basic principles, you will have a much better understanding of where the potential of GenAI language models lies, but also where there are still limitations, and what you would have to do to make LLMs transform your way of work.
Autocomplete, turbocharged
The first thing to understand is that GenAI systems actually do the same as the good old Google autocomplete that you have known for years. They take some input and try to guess what the next word (or phrase) should be. To generate longer text, they quite simply do this guessing game over and over again, until they guess that the current text is long enough.
The first reason is instead of a simple list of the most frequent search terms (like the old autocomplete has), they have a very complex Deep Learning Model behind it. This requires a lot of computing power, and equally a huge amount of input data. Just for an example, GPT-4 is estimated to have been generated on the equivalent of 10,000 computers working at the same time, uninterrupted for 4-5 months, and for input data the entire openly available internet was used. You can see how Sam Altman seeking over a trillion dollars of funding for the next generation of AI is driven by this ravenous appetite for computing power (if you would like to know more about Deep Leaning, dive into episode 1 of the Decoding Data & AI series).
Essentially, they realized that every word has lots of relationships to other words. For instance, a cat is always a furry animal, often cuddly, but never a dog. That also means, that if I want to understand the meaning of a word, I only need to know all these relationships.
Crucially, now I have a way in which computers, which deep down can only deal with the numbers 0 and 1, can also understand words. When ChatGPT reads the word “cat”, it represents it as this web of relationships to other words: a furry and cuddly animal, but not a dog. And in principle, all of these relationships can be characterized as numbers.
To be sure, the GenAI leaders made an effort to create these numbers: GPT-4, for instance, uses over 12,000 of these relationships to characterize the meaning of each word. Imagine therefore a huge table with every word in the English language, and 12,000 columns for each of these words.
And now that LLMs can “understand” the meaning of a word, it can also make sense of how words have similar or opposing meanings. The word “meow” for example, will have a very similar profile to “cat” – meowing is what furry, cuddly animals do, but not dogs.
In this way, ChatGPT “reads” the words in your prompt, transforms this into a very complex web of word relationships (i.e., “understands” it), and then can use this “understanding” to predict what the next best word is. When it wants to find out what the next word after “cat” is, for example it may see that “meows” is a good fit due to the similar profile.
Now this is of course only a simplification, as all these word values are modified according to the words that precede the word we are looking at – ChatGPT must for example understand that a “toy cat” is actually not an “animal.” These modifications are the secret ingredients of the different AI models, and also a main reason for their differences.
Finally, you may ask: where does ChatGPT learn all these word meanings (i.e., the 12,000 relationships for each word)? The answer here is that it does this through reading lots and lots of text (the entire open internet), and finding out how often words occur together with each other. This “training” process is what takes so long and requires so much computing power. And hence, how much these models “know” about words is the cause of performance differences.
What does this mean for you?
- LLMs are so big and so expensive because they must handle all possible inputs, and hence understand all the possible meanings of words. For task-specific language generation (Chatbots, Coders), much smaller models can work equally well.
- LLMs do not “know” facts as such – they only really “know” the relationship between words, and this is learned from the use of those words in publicly available text. So the quality of training data is actually crucial for the quality of the LLMs' responses.
So, now you have a basic understanding of what LLMs like ChatGPT actually do when they read your prompt, we hope you will be able to use them more productively and with their inherent limitations in mind.
Want to learn more about OMMAX's expertise in data & AI? Get in touch with our experts through the form below and sign up for our Decoding Data & AI series!
Contact an expert
Do you want to know more about our expertise? Get in touch!
Sign Up for the Newsletter
Development and Execution of a Customized Digital Growth Strategy