8 Minutes Read By Felix Gerlsbeck

Decoding Data & AI: How do Large Language Models actually understand our language?

#Advanced Data Analytics#Artificial Intelligence#Decoding Data & AI#Digital Strategy#Digital Transformation#Tech

In our Decoding Data & AI series, we provide you with key insights for successful data & AI projects in a clear and easily understandable format, empowering your business to thrive by facilitating integration into your corporate strategy. As part two of our “Decoding Data & AI” series, we are having a look at Large Language Models (LLMs), the kind of technology underpinning all of the current groundbreaking advances in AI text generation, be it OpenAI’s ChatGPT, Google’s Gemini, or Elon Musk’s Grok. As you probably know, these programs take in some text from the user (a “prompt”) and then they deliver a response that answers your question, generate the summary that you asked for, or just has a nice little chat.

Have you ever asked yourself how these AI systems actually do this? Well, here is a fast and easy explanation for you. Once you understand these basic principles, you will have a much better understanding of where the potential of GenAI language models lies, but also where there are still limitations, and what you would have to do to make LLMs transform your way of work.

Autocomplete, turbocharged

The first thing to understand is that GenAI systems actually do the same as the good old Google autocomplete that you have known for years. They take some input and try to guess what the next word (or phrase) should be. To generate longer text, they quite simply do this guessing game over and over again, until they guess that the current text is long enough.

We know that this sounds difficult to believe. ChatGPT is supposed to be a close relative of Google autocomplete? How could the LLMs do this thing so much better than the old autocomplete we know and love (or hate)?

The first reason is instead of a simple list of the most frequent search terms (like the old autocomplete has), they have a very complex Deep Learning Model behind it. This requires a lot of computing power, and equally a huge amount of input data. Just for an example, GPT-4 is estimated to have been generated on the equivalent of 10,000 computers working at the same time, uninterrupted for 4-5 months, and for input data the entire openly available internet was used. You can see how Sam Altman seeking over a trillion dollars of funding for the next generation of AI is driven by this ravenous appetite for computing power (if you would like to know more about Deep Leaning, dive into episode 1 of the Decoding Data & AI series).

Understanding the Meaning of Words

The second reason is due to some extremely clever ideas developed about 10 years ago mainly at Google, but also elsewhere, these GenAI models can now really “understand” your prompt. How do they do that, even though they really are only very powerful calculators?

Essentially, they realized that every word has lots of relationships to other words. For instance, a cat is always a furry animal, often cuddly, but never a dog. That also means, that if I want to understand the meaning of a word, I only need to know all these relationships.

Crucially, now I have a way in which computers, which deep down can only deal with the numbers 0 and 1, can also understand words. When ChatGPT reads the word “cat”, it represents it as this web of relationships to other words: a furry and cuddly animal, but not a dog. And in principle, all of these relationships can be characterized as numbers.

To be sure, the GenAI leaders made an effort to create these numbers: GPT-4, for instance, uses over 12,000 of these relationships to characterize the meaning of each word. Imagine therefore a huge table with every word in the English language, and 12,000 columns for each of these words.

From Understanding to Writing

And if computers love one thing, it is orderly tables with lots of numbers. Now they can actually run mathematical calculations on the meaning of words.

And now that LLMs can “understand” the meaning of a word, it can also make sense of how words have similar or opposing meanings. The word “meow” for example, will have a very similar profile to “cat” – meowing is what furry, cuddly animals do, but not dogs.

In this way, ChatGPT “reads” the words in your prompt, transforms this into a very complex web of word relationships (i.e., “understands” it), and then can use this “understanding” to predict what the next best word is. When it wants to find out what the next word after “cat” is, for example it may see that “meows” is a good fit due to the similar profile.

Now this is of course only a simplification, as all these word values are modified according to the words that precede the word we are looking at – ChatGPT must for example understand that a “toy cat” is actually not an “animal.” These modifications are the secret ingredients of the different AI models, and also a main reason for their differences.

Finally, you may ask: where does ChatGPT learn all these word meanings (i.e., the 12,000 relationships for each word)? The answer here is that it does this through reading lots and lots of text (the entire open internet), and finding out how often words occur together with each other. This “training” process is what takes so long and requires so much computing power. And hence, how much these models “know” about words is the cause of performance differences.

What does this mean for you?

  • LLMs are so big and so expensive because they must handle all possible inputs, and hence understand all the possible meanings of words. For task-specific language generation (Chatbots, Coders), much smaller models can work equally well.
  • LLMs do not “know” facts as such – they only really “know” the relationship between words, and this is learned from the use of those words in publicly available text. So the quality of training data is actually crucial for the quality of the LLMs' responses.

So, now you have a basic understanding of what LLMs like ChatGPT actually do when they read your prompt, we hope you will be able to use them more productively and with their inherent limitations in mind.

Want to learn more about OMMAX's expertise in data & AI? Get in touch with our experts through the form below and sign up for our Decoding Data & AI series!

By Felix Gerlsbeck

Contact an expert

Do you want to know more about our expertise? Get in touch!

Industry Insights

How to transform an industrial giant into a digital leader

In today's digital age, businesses must constantly evolve to stay competitive. Culligan/Waterlogic, a leader in water dispensing solutions, [...]

Industry Insights

Decoding Data & AI: A deep dive into central data warehouses

In our Decoding Data & AI (Artificial Intelligence) series, we provide you with key insights for successful data & AI projects to boost your business. [...]

Industry Insights

Your Path to Cyber Resilience

No company wants to be on the news for the wrong reasons – which means that information security is no longer a luxury; it is a necessity. With the [...]

Industry Insights

AI in pharma: The right approach for value creation and efficiency gains

Over 40% of tasks in the pharma industry are eligible for some form of AI automation or augmentation, according to a study by the World Economic [...]

Case Studies

The CWS Case

Future winners think about marketing/sales optimization and growth in new ways, actively pursuing multiple digital transformation dimensions. Here is [...]

Case Studies

The Median Case

MEDIAN is the biggest private operator of rehabilitation facilities in Germany with an outstanding reputation. The group operates 122 rehab clinics, [...]

Case Studies

The WAGO Case

WAGO is an internationally leading supplier of connection and automation technology and interface electronics, as well as the global market leader in [...]

Case Studies

ghd: Scaling digital marketing and sales performance holistically

As an innovative and premium brand for hairstyling products, ghd seeks to provide their customers with an at-home hair salon experience with [...]

Sign Up for the Newsletter

Development and Execution of a Customized Digital Growth Strategy