Xinzhiyuan Report Editor: Haili
[Xinzhiyuan Introduction] Large Language Models (LLMs) are evolving at an unprecedented pace: METR has found that their intelligence doubles every 7 months. By 2030, a single model might be able to complete months of work for a human engineer in just a few hours. Don't blink, your job might already be on a countdown.
As large model capabilities soar, various evaluation benchmarks are springing up everywhere.
From classic MMLU and HellaSwag, to multimodal directions like MMMU and MathVista, and AGI-style Arena battles, Agent tasks, and Tool-use tests.
How to scientifically measure LLM's capabilities in long, complex, real-world tasks is crucial.
In March this year, METR published a significant study, "Measuring AI Ability to Complete Long Tasks," which for the first time proposed an exciting new metric:
50%-task-completion time horizon – that is: how long does it usually take a human to complete tasks that AI can achieve with a 50% success rate?
Paper link: https://arxiv.org/pdf/2503.14499
Based on this, METR conducted a series of studies, including task complexity settings, human baseline time measurements, multi-model comparison experiments, and layered statistical regression modeling.
Ultimately, the team precisely quantified the evolution speed of AI intelligence and put forth a startling prediction:
At the current rate of growth, in five years, large models may be able to automatically complete complex tasks in a single day that would originally take human months to finish.
Don't blink, LLMs' power doubles every 7 months!
The METR team selected the strongest model for each time period and established a precise "chronology," further quantitatively analyzing the growth of model capabilities over time.
The results show a clear exponential growth trend: over the past six years, model capabilities have doubled every 7 months.
The shaded area in the graph represents the 95% confidence interval calculated by hierarchical bootstrap across task families, tasks, and task attempts.
However, this exponential growth trend is very steep, so it has a high tolerance for error.
Even if the absolute measurement error reaches 10 times, the time for capability arrival would only change by about 2 years.
Therefore, the team's predictions for when different capabilities will emerge are largely accurate.
Models vs. Humans: Measuring Large Model Intelligence with "Human Time Spent"
The core of METR's research is the metric they proposed: "task-completion time horizon."
This metric is equivalent to mapping humans and AI who complete tasks separately:
Imagine a set of diverse tasks, each requiring different times for humans to complete.
These tasks are then given to AI models, and the tasks that AI can complete with a 50% success rate are identified (without considering the time AI takes).
Then, the corresponding time humans typically need to complete this category of tasks is found.
This human-required time is the model's 50%-task-completion time horizon, also known as the "task completion time horizon."
To prove the effectiveness of this benchmark, the METR team conducted extensive statistical analysis.
The results show a negative correlation between the human baseline time to complete a task and the average success rate of each model on that task.
In short, the slower a human takes, the more likely the model is to fail.
Furthermore, fitting this negative correlation trend with an exponential model works very well.
A regression analysis of model success rate against the logarithm of human completion time yielded an R² of approximately 0.83 and a correlation coefficient of 0.91, which is higher than the correlation coefficient of average success rates between different models.
Therefore, measuring task difficulty "in human time" is a very reasonable metric.
Newer Models, Harder Tasks: Evidence of Capability Evolution
Having proven the effectiveness of this metric, the next step was to examine the performance of various models against it.
The team further examined the human time required for tasks that different models could complete.
The results were quite intuitive:
Models from before 2023 (such as GPT-2 and GPT-3) could only complete simple tasks that required writing a few sentences.
However, for tasks that took humans more than 1 minute, they quickly failed.
In contrast, the latest frontier models (such as Claude 3.5 Sonnet and o1) can complete tasks that would take humans several hours, and even maintain a certain success rate on ultra-long tasks lasting over ten hours.
Human-Crushing Efficiency: 2030 Warning Has Been Sounded
Following this "doubling every 7 months" pace, the METR team arrived at a striking conclusion:
By 2030, the most advanced LLMs are expected to complete, with 50% reliability, a task that would take a human engineer working 40 hours a week for a month.
What's even more chilling is that LLMs' speed could far surpass humans' – perhaps taking only a few days, or even hours.
By 2030, LLMs might be able to easily start a company, write a decent novel, or significantly improve existing large models.
AI researcher Zach Stein-Perlman wrote in his blog that the advent of LLMs with such capabilities "will bring about significant impacts, both potential benefits and potential risks."
Kinniment admits that the speed at which LLM capabilities are doubling is frightening, like the prelude to a sci-fi disaster.
However, she also states that in reality, many factors could influence and slow down this progress.
No matter how smart AI is, it may still be constrained by bottlenecks such as hardware and robotics.
References: https://spectrum.ieee.org/large-language-model-performance