Please call them AGIs, not LLMs

Let's be honest with ourselves: we are asking the AI systems we are building to be general intelligences and evaluating them as such.

Jun 16, 2024

What are GPT-4, Claude 3, Llama 3, or Mistral Large, really? Most people currently refer to them as LLMs, or Large Language Models. That could apparently be fair, of course, if you roughly know how they are built. But is calling them LLMs useful or clear? I believe it is not.

LLMs are the fundamental technology at the core of the current iteration of these systems, which are typically pretrained to predict words encountered in web data. They are, however, much more than that. When we evaluate them, we focus on how they understand language and possibly images; on how much humans like them as assistants; on whether they can write code, use tools, act as agents, and much more. The community is constantly building new benchmarks, more general and more difficult than before.

I somehow feel uncomfortable about calling a system evaluated on such a large variety of diverse tasks just an LLM. It feels like that if we want to call such a general system an LLM just because it predicts words, then… We should also call humans LLMs?

Jokes apart, I believe we are making the mistake of confounding the technology that is powering a system for what the system really is. This issue has been roughly discussed while comparing so-called “closed models”, which typically consist of several interacting pieces, to “open models”, which may only involve a single neural network. Something feels off about directly comparing the two, but especially in referring to both of the systems with the same name of “LLMs”.

What’s a better name, then? I believe we should call these systems AGIs, Artificial General Intelligences. I’m not writing this post to argue for a definition of AGI, since many words have been spent on the topic. The goal of the field of AI has been for long thought to be the one of building an AGI. This has put an excessive amount of sacrality on the expression.

Let’s appeal to our intuition though, as opposed to a “goalpost definition”. We are asking these systems to be incredibly general, even more than we’d normally ask to humans in their everyday life. And while they might still lack some of the adaptability or autonomy of humans or other intelligent beings, they are definitely simulating a high degree of intelligence in a remarkably general way.

When I tell other researchers that I believe that ChatGPT-3.5 will be recorded in history books as the first general intelligence that humand kind has every built, many of them are not that surprised. We might have built better AGIs now, but the gut feeling of many fellow researchers is that we are indeed dealing with general intelligences.

Then, why should we keep lying to ourselves? Let’s use the real word and simplify our terminology. It doesn’t mean that, since we have widely-accessible AGIs, the work of the field of artificial intelligence is done. It is not a very autonomous AGI, it is not an AGI that is perfect at reasoning, it is not an AGI that has familiarity with actuation in the physical world, it is not an AGI that is automating most of 2024’s human labor, it is not a superinteligence. We can work on making our AGIs more general, robust or controllable, and on deciding whether and how to take them to the next level. All of this, without denying that they are already AGIs.

To summarize, here’s some of the reasons why I believe we should call these systems AGIs instead of LLMs:

Terminology inconsistency fix: most people are already thinking of GPT-4, Claude 3 or Llama 3 as general intelligences. This is well-reflected in the varity of use cases that we ask them to be good at, and in the benchmarks that we use to evaluate them.
Clearer comparisons: it will encourage being clearer on which type of benchmarking we are doing. Are we benchmarking an AGI against another? Or are we benchmarking the underlying LLMs? The nature of the comparisons might wildly differ. As mentioned above, the nature of current benchmarking seems to be more of the first type despite being advertised to be of the second type.
Clarity about the release of AGIs and LLMs: one might decide to release a single component of an AGI, such as its image encoder, its base LLM, its prompts, or the reward function that is being used online for rejection sampling. Or one can decide to release the entire AGI. I believe this terminology will help in clarifying what a release is really about.
Inspiration for new benchmarks: being 100% honest to ourselves about the fact that we are building general intelligences can inspire new benchmarks. What are the features that we really want these AGIs to have? As an example, there might be a stronger push on benchmarks capturing the creation of agents, or autonomy, as opposed to pure natural language tasks.
Agnosticism to architecture and training strategy: the next generation of AGIs might not be using a next-word prediction criterion as a main pretraining objective. We want our mindset, intuitions and benchmarks to be agnostic of the particular form of our current AGIs. We can compare AGIs based on LLMs to other types of AGIs, using the same benchmarks. And without incurring in the uncomfortable situation of not knowing how to call those systems.

Next time you’ll be building or evaluating an AI system like that, try to call it in a way that reflects how you’re probably already thinking about it: an AGI. Yes, you can say you have built an AGI. A probably imperfect, flawed, and still largely improvable AGI.

Science of AI Agents

Discussion about this post