Scaling laws for AI Large Language Models and the inverse scaling hypothesis
What are the scaling laws for AI Large Language Models and the inverse scaling hypothesis? How is that related to the Dunning-Kruger effect?
The last couple of years have been an AI model arms race involving a number of players from industry and research. Google, DeepMind, Meta, Microsoft, in collaboration with both OpenAI and Nvidia are the names most people will recognize.
But we also also have the international research cooperation BigScience, Baidu and Tsinghua University from China, Yandex from Russia, Aleph Alpha from Germany, AI21 labs from Israel and Stability AI from the UK, to name just some of the key organizations working in the field.
All of them are developing different versions of what is called Large Language Models. Large Language Models (LLMs) are trained on huge corpora of text and feature parameters that are measured in the billions. These are the types of models used to power headline-grabbing applications like ChapGPT as well as a growing number of applications across different domains ranging from content marketing to biology and law.
As this is a nascent domain which is a mix of engineering, art and science, more findings are gradually unearthed as more effort is put into developing Large Language Models. Initial empirical findings suggested that there is a correlation between the number of parameters in LLMs and their performance, something which came to be known as the scaling hypothesis or scaling law.
But is scaling all you really need to achieve better performance? Later findings suggest that this may not always be the case. In fact, sometimes, the correlation may be an inverse one. In other words, larger models may actually perform worse than smaller ones for specific tasks.
In this post, we’ll first introduce some of the arguments for and against scaling as the means towards the end goal of better AI models. We will then refer the Inverse Scaling Prize (ISP), an initiative set up to investigate scaling laws for AI Large Language Models and the inverse scaling hypothesis. We wrap up by an analysis of ISP’s finding as as discussed with Ian McKenzie, FAR AI Research Scientist. You can listen to the entire conversation with McKenzie in the podcast below.
Scaling laws for AI Large Language Models
The fact that LLMs demonstrate some impressive, almost human-like, features is plain to see. Even AI critics acknowledge the progress made so far, mostly based on training models with an ever-growing number of parameters using unsupervised learning on ever-growing datasets.
Where views diverge, however, is on whether that alone can lead to progressively better models, perhaps even to the point of reaching AGI (Artificial General Intelligence). Proponents of this approach argue that it can. In fact, recent findings suggest that LLMs show emergent abilities as their size grows.
An ability is considered to be emergent if it is not present in smaller models but is present in larger models. The thesis is that existence of such emergence implies that additional scaling could further expand the range of capabilities of language models.
As to why these emergent abilities are manifested, some possible explanations offered are that tasks involving a certain number of steps may also require a model having an equal depth, and that it is reasonable to assume that more parameters and more training enable better memorization that could be helpful for tasks requiring world knowledge.
As the science of training LLMs progresses, researchers note, certain abilities may be unlocked for smaller models with new architectures, higher-quality data or improved training procedures. That means that both the abilities examined in this research, as well as others, may eventually be available to users of other AI models, too.
Prominent AI critics like Gary Marcus, on the other hand, worry about whether contemporary approaches to AI will ever provide solutions to four key aspects of thought that one ought to expect from any intelligent machine: reasoning, abstraction, compositionality, and factuality. Others point out a number of reasons – technical, scientific, philosophical, sociopolitical and economic – why it’s not wise to pursue ever-larger models indefinitely.
Both sides have compelling arguments to offer, and it’s important to be aware of them. It’s also important to keep on open mind about this and to be able to draw lessons from the history of technical and scientific progress, which is hardly ever a straight line.
The Inverse Scaling Prize
Ian McKenzie, FAR AI Research Scientist, is a key member of the team organizing the Inverse Scaling Prize (ISP). Scaling laws show that language models get predictably better (in terms of test loss and downstream performance) as the number of parameters, amount of compute used, and dataset size increase. The team behind ISP hypothesized that there are tasks with trends in the opposite direction.
That is, tasks in which performance gets monotonically, predictably worse as the overall test loss of the language model improves. This phenomenon is called inverse scaling, in contrast with the standard scaling laws. There are some tasks that appear to show inverse scaling under some conditions, but such tasks appear to be rare.
The ISP was set up to investigate scaling laws for AI Large Language Models and the inverse scaling hypothesis. The emphasis is on find inverse scaling tasks of importance to the safe and responsible use of language models. The ISP team set up a competition with a number of prizes to encourage people to submit inverse scaling tasks, defined a framework for submission and evaluation, and did the hard work of evaluating submissions.
McKenzie got involved in the ISP effort in early 2022 via his NYU supervisor at the time, Ethan Perez. McKenzie’s research is focused on AI alignment, which aims to steer AI systems towards their designers’ intended goals and interests. In this context, inverse scaling tasks are important because they represent a mismatch between the behavior people want language models to exhibit and the behavior they get in practice from the training objectives and data used.
“LLMs are trained just to predict text, but this isn’t necessarily what we actually want them to do. We want them to reason well, to give good advice and to be generally helpful, rather than reproducing just anything that they see on the Internet. So this is the angle that I’m coming at it from”, McKenzie said.
McKenzie went on to add that this sentiment is shared by other people in the ISP, but not necessarily all. Some may just find this an interesting technical problem. It is indeed a rather specific and advanced topic, which nevertheless attracted a relatively high number of submissions. McKenzie thinks that the ISP presented a reasonably well-scoped problem, thus also enabling people who don’t have a very strong technical background to participate.
There were 2 ISP calls for submissions, one with a deadline in August 2022 and another one in October 2022. There were 43 submissions in the first round and 48 in the second round, with some overlap between first and second round submissions. This was by design, encouraging people who submitted in the first round to draw lessons and resubmit. It seems to have worked, in the sense that 2nd round submissions were on average significantly more elaborate.
Evaluating the inverse scaling hypothesis
There are six criteria in evaluating submissions in the ISP. As McKenzie noted, they are somewhat overlapping, but they capture the spirit of what the ISP is looking for. The first criterion is the inverse scaling strength. This is meant to capture how much worse larger models are than smaller models at a specific task.
Another criterion is generality. This is meant to express how many of these model families do we see a specific kind of inverse scaling trend. A third criterion is task importance. In other words, how big of a deal would a certain failure mode be. The fourth criterion is novelty and surprisingness. That expresses how different a result is from everything that was seen before.
Task coverage is the fifth criterion, which is meant to express how complete the coverage of certain task which is being examined is. Reproducibility is the last criterion, which is meant to express the degree to which a similar dataset would lead to observing similar behavior in a task. This is not reproducibility in the standard definition, i.e. being able to get the same results using the same input, and is related to task coverage.
The ISP team procured three different prize tiers – 1 Grand Prize, 5 Second Prizes and 10 Third Prizes. As far as results go, the most striking thing was the absence of submissions deemed significant enough to warrant anything more than a Third Prize. What this shows, McKenzie believes, is that convincingly demonstrating inverse scaling is quite challenging for a number of reasons.
First, he noted, in order to make the contest accessible the scope and types of submissions was limited and the focus was on narrow metrics. On the one hand, this helped make it easier for people to find interesting tasks. On the other hand, making convincing demonstrations this way was tricky, especially with the amount of time people were able to dedicate to this, McKenzie added.
Although there were no demographic data collected, McKenzie’s impression is that most people who submitted were researchers early in their careers for whom the ISP was a side-project at best. Therefore, getting enough evidence and making a compelling enough case was challenging. In addition, McKenzie said, language models with the size the ISP is interested in often produce a lot of noise in the results, which makes it hard to be extremely sure of anything.
McKenzie noted there were some interesting preliminary results both in round 1 and in round 2 of the ISP, which demonstrate that there are effects going on. Just nothing crisp and clear enough to pass the strict criteria of the grand prize. Some of the tasks that were awarded a Third Prize were related to things like quote repetition, prompt injection and reasoning.
The Dunning-Kruger effect for LLMs and (bitter) lessons learned
The ISP team was able to extract valuable insights from the process of evaluating scaling laws for AI Large Language Models and the inverse scaling hypothesis. One thing they noticed is that often in the inverse scaling tasks the performance on smaller models starts out about random. If it’s a classification task, models randomly select between the options because they don’t really know how to solve the task, McKenzie noted.
The inverse scaling effect starts when the model “thinks” that it understands how to do the task but is really getting it wrong from our perspective. Therefore, the model becomes more confident on the wrong answer. This is a general pattern, and one way it can manifest is when there is a hard task and within it, an easier task as a sub-component.
The model has to solve the easier task and then solve the harder task. The smaller models are stuck altogether, but the larger models are able to solve the easier task but not the harder task. So models think they’re done and output the answer to the easier task, but that’s wrong. An example of this is negation, which is a pretty hard problem for language models as per McKenzie.
Another pattern that emerged is the fact that models really love to repeat stuff, McKenzie said. For example, they like to repeat quotes that they’ve learned or to repeat patterns. McKenzie mentioned the fact that larger models may be “more stuck in their ways” as one potential explanation for failing in certain inverse scaling tasks. “They are more likely to have strongly memorized context from the internet and are thus more likely to repeat it”, as he put it.
We wondered what could potentially be done to alleviate this, enabling LLMs to retract and update part of the data they have been trained on. McKenzie pointed out that this is an open research question at this time, with techniques like Reinforcement Learning from Human Feedback being explored. McKenzie also mentioned the U-shaped scaling hypothesis.
Although McKenzie is not convinced that this will actually happen, he mentioned that there are some hints to suggest that as models get bigger, their performance in inverse scaling tasks may start to improve again. In harder task – easier task aggregations, for example, if getting bigger LLMs is enough to make them able to solve the harder task as well, then their performance will improve.
With all the anthropomorphising going on, whether consciously or unconsciously, we could not help but notice that in some ways the U-shaped scaling hypothesis resembles the well-known (albeit later questioned) Dunning-Kruger effect. McKenzie himself was doubtful on whether architectural improvements could help alleviate issues and produce more robust AI models, citing Rich Sutton’s “Bitter Lesson”, i.e. the conviction that more compute trumps anything else.
Either way, the Inverse Scaling Challenge was in interesting experience for the team, one which will be further analyzed and documented in a research paper that will appear soon. The team is also considering the possibility of running the ISP again.