Analysis: Large language models are what makes generative AI software create text that sounds as human as possible
The rate of development and deployment of AI, especially generative AI, over the last 18 months has been dizzying. Most forecasts indicate that generative AI will, or is already, having impact across many of the tasks we perform across education, finance, law, the creative arts, and more.
The basis for most text-based generative AI applications are called Large Language Models (LLMs). Regularly we read headlines such as "OpenAI and Meta are on the brink of releasing new artificial intelligence models–Llama 3 and GPT-5–that will be capable of reasoning and planning". But what are these LLMs? LLMs are statistical models of the distributions and co-occurrences within any set of 1-dimensional (e.g. text or DNA sequences), 2- dimensional (e.g. images), 3- dimensional (e.g. video) or larger sequences of training data.
LLMs are created by taking training data such as text and turning it into an embedding space. Imagine a bag of coloured balls. An embedding space is a flat surface where the balls are arranged in such a way that those with similar colours are placed closer together. In an embedding space for words, words which are related like "king" and "queen", or "computer" and "keyboard" would be closer together while the words "queen" and "keyboard" are further apart. For LLMs, the number of words to be arranged, the size of the embedding space, and the time needed to train or arrange the words, are enormous.
We need your consent to load this rte-player contentWe use rte-player to manage extra content that can set cookies on your device and collect data about your activity. Please review their details and accept them to load the content.Manage Preferences
From RTÉ Radio 1's Brendan O'Connor, ChatGPT... but flirtier? Host of podcast For Tech's Sake Elaine Burke reviews ChatGPT4o
But how enormous? The most popular LLM from OpenAI is GPT-4 used in ChatGPT. GPT-4 was trained on a collection of text so large that if it was printed as a set of books, stacked and arranged as if on a bookshelf, that bookshelf would be 650 km, longer than the distance between Malin and Mizen Head. On my MacBook Pro laptop computer, the time needed to compute that embedded space from the long bookshelf of training data would be 2.9 million years. If you were to print the whole of the embedded space as if it was an Excel sheet (ignore why you would want to do such a thing) then it would be 25 times larger than the size of the Phoenix Park in Dublin.
While the most well-known LLM might be GPT-4, there are many more of them around now. Google's LLM is called Gemini, Meta’s is called LLaMA, Antrophic have one called Claude, EY call theirs EYQ, Apple call theirs FERRET, and so on. Practically every large and medium-sized tech company has now developed their own LLM using their own training data.
When tech companies announce their latest LLMs they tend to make a big play of how large is that latest model. Because we now have competition in this area, these are teasers about model size, reminiscent of 25 years ago where web search engines competed with each other by advertising how much of the web they had indexed. So Google would have on their home page "we index 4 billion web pages", then a few weeks later Alta Vista would say "we index 4.5 billion pages", and Inktomi would counter with 5 billion. This cold war eventually stopped when Google reached 8 billion pages and then they all stopped comparing sizes. We may see the same happening with LLMs.
We need your consent to load this rte-player contentWe use rte-player to manage extra content that can set cookies on your device and collect data about your activity. Please review their details and accept them to load the content.Manage Preferences
From RTÉ Radio 1's Today with Claire Byrne, Professor Alan Smeaton, from the Insight Centre for Data Analytics at DCU takes a look at how AI could impact jobs
Some commentators make claims that larger LLMs will be able to reason and deduce and even to plan how to perform tasks at a level of sophistication greater than that of humans. It is true that in any kind of statistics, and LLMs are only statistical models, the more data we have, the more reliable and sophisticated those models will be. "We need more data" has always been the mantra of those working in AI and machine learning, and it is a Universal truth that more data leads to more accurate and reliable outcomes.
But this is generative AI where those rules do not necessarily hold true. We do see that bigger LLMs can give better performance on some logic and deduction tasks but we also have counter-examples. French company Mistral have a series of LLMs which are much smaller but which out-perform the monster LLMs like GPT-4 in tests for logic and reasoning. Mistral’s LLMs focus on reducing their size by eliminating parts that are not used, making them more efficient and requiring less energy to create and use.
Read more: 5 quirks we found using AI to translate text into Gaeilge
So how does the logic and deduction and inference happen in LLMs? We simply do not know because we do not yet know anything about the internal structures or the operation of LLMs that are being created, and that is because they are so new. The worry is not that AI using LLMs will do planning and deduction and take over the world, but that big tech companies are now focusing on deploying AI and not on understanding it. And the reason for this is economics; there is a lot of money to be made in this area.
It is comforting to know that the EU AI act will prohibit some of the companies doing what they have been doing with LLMs. Companies will not be able to use content they do not own or have legal access to, they must provide an inventory of their training data, and must show bias detection and mitigation strategies in the data they use to train their models. This will create a more level playing field for LLMs and generative AI.
Follow RTÉ Brainstorm on WhatsApp and Instagram for more stories and updates
The views expressed here are those of the author and do not represent or reflect the views of RTÉ