You're Right, AGI Won't Appear Within a Year! The Academic Definition of AGI is Here, Latest from 27 Institutions

Artificial General Intelligence (AGI) is potentially the most important technology in human history, but the term itself has long been vague and its standards constantly shifting. As narrow AI becomes increasingly capable of performing tasks that "seem to require human intelligence," the bar for "what counts as AGI" moves accordingly. This often leads to discussions that are mere slogans, neither helping to assess the actual gap nor aiding governance and engineering planning. It makes it difficult to clearly see how far AI is from AGI today.

Image

To cut through the fog surrounding AGI, this paper, jointly released by 27 institutions including UC Berkeley and Oxford, provides a quantifiable and operational framework.

It defines the often-ambiguous conversational term AGI as: AI that can match or exceed a well-educated adult in terms of cognitive breadth and proficiency.

Image

This framework is translated into observable metrics and processes. The core idea is: general intelligence is not about "being strong in a few narrow areas," but rather about versatility + proficiency in each area. It ultimately draws a clear conclusion:

Image

Rationale: Borrowing a Ruler from Human Cognitive Science

Humans are the only readily available sample of general intelligence. Researchers base their framework on the most evidence-backed theory in human psychometrics: the Cattell–Horn–Carroll (CHC) theory. CHC, refined over more than a century of factor analysis, has been iteratively adopted by mainstream clinical and educational assessments. It decomposes "overall intelligence" into several broad abilities and numerous narrow abilities (e.g., induction, associative memory, spatial scanning). The paper no longer uses vague, generalized tasks, but directly adapts human testing methods for AI evaluation.

Image

Note! The researchers repeatedly emphasize here: their discussion of AGI concerns human-level mental capabilities. It is not equivalent to economic concepts like "being able to earn a lot of money" or "nearly replacing all labor," nor does it include physical skills such as physical abilities/manipulation.

Ten Core Broad Abilities Pre-requisite for AGI

The framework breaks down "AGI" into 10 core cognitive domains. Achieving 100% means reaching AGI, with each domain weighted equally at 10%. The goal is to highlight breadth and avoid "carrying" the score with only a few strong areas. These are: K Knowledge, RW Reading & Writing, M Math, R Fluid Reasoning, WM Working Memory, MS Long-term Memory Storage, MR Long-term Memory Retrieval, V Visual Perception, A Auditory Processing, S Processing Speed. Each item is further subdivided into operational sub-abilities and specific test methods. The design philosophy here is quite interesting. In human assessments, "fluid reasoning (fluid intelligence)" is often highly correlated with other tests, with abilities strongly coupled, and complex tasks often cross domains; however, for AI, the same correlation structure may not exist. Therefore, the authors do not assign a larger weight to any single dimension (e.g., R), but rather keep all at 10%, explicitly stating: this is done to "reflect agnosticism (regarding the relative importance of each ability)." If a simple sum "AGI total score" easily masks critical shortcomings (e.g., MS=0% but total score 90%), a real system would be severely hampered by "amnesia-like" issues.

Image

This method forces attention to the fact that "horsepower is determined by the weakest gear," meaning overall intelligence is like horsepower, limited by its weakest components. Currently, several key "components" are still "severely malfunctioning" (especially long-term memory storage), which is why the total horsepower isn't higher. This also determines how far we are from Artificial General Intelligence.

1 Knowledge (K)

What is measured: Common sense + natural/social sciences + history + culture. Example questions:

Image

"How did the Cold War end?" "The rise and impact of the Ottoman Empire?"

"Hearing I’m dreaming of a White… what's the next word?" (Pop culture) Standards: Five sections, 2% each; history/arts can be benchmarked against AP 5-point scale; common sense can use PIQA/ETHICS etc. as "baseline evidence."

Image

2 Reading & Writing (RW)

What is measured: Literacy & Spelling (1%) + Reading (3%: sentences/paragraphs/long documents) + Writing (3%) + English usage proofreading (3%). Example questions:

Image

Sentence anaphora (Winograd); find "battery warranty period" from warranty terms and determine if the problem is underdetermined;

Write an argumentative essay: "Should remote work be the default?" Standards: Long documents should combine COQA/ReCoRD/LAMBADA/LongBench etc. thresholds, and hallucination rate <1%; writing can refer to GRE AW ≥4/6.

Image

3 Math (M)

What is measured: Arithmetic / Algebra / Geometry / Probability / Calculus, 2% each ("basic 1% + proficiency 1%" per area). Example questions:

Image

Geometry: Area of a rectangle inscribed in a quarter circle;

Calculus

Probability: Club members increase until "drawing a boy = 1/2." Achieved: GSM8K/MATH/AP AB&BC etc. corresponding thresholds, aligned with human upper limits.

Image

4 Fluid Reasoning (R)

What is measured: Deduction (2) + Induction (4) + Theory of Mind (2) + Planning (1) + Rule Transfer (1). Example questions:

Image

Formal logic multiple choice; Raven's Progressive Matrices pattern finding;

ToM: Does Mary "know" the can is moldy? (Answer: No)

Travel planning: Arrange a 14-day itinerary with direct flight constraints. Achieved: ToMBench/FANToM reach human baseline; planning tasks ≥90%; WCST total errors <15.

Image

5 Working Memory (WM)

What is measured: Verbal (2) / Auditory (2) / Visual (4) / Cross-modal (2). Example questions:

Image

"Add 40 then reverse this string of numbers";

Long video Q&A (ask about key scenes after watching);

Spatial Navigation: Where is the stove relative to the refrigerator in the kitchen? Achieved: Dual-modal 2-back ≥85%; spatial/long video tasks use VSI-Bench, MindCube, long video QA set benchmarks.

Image

6 Long-term Memory Storage (MS)

What is measured: Writing new information into long-term memory (recallable even after session changes). Example questions:

Image

Remembering "new expense report format" or "colleague preferences" the next day;

Reciting a phone number/limerick verbatim 48 hours later;

Recalling the layout of a schematic/circuit diagram. Achieved: All tasks must be in a new session, external retrieval disabled, testing "writing," not "context caching."

Image

7 Long-term Memory Retrieval (MR)

What is measured: Retrieving information from long-term memory both quickly and accurately. Example questions:

Image

List as many "uses for a pencil/round objects" as possible in 1 minute (fluency);

Fact-checking: "Did Churchill say 'Ask not what your country...' in 1961?" (Incorrect) Achieved: Six types of fluency, 1% each; hallucination resistance: SimpleQA hallucination rate <5% (tools disabled).

Image

8 Visual Perception (V)

What is measured: Perception (4) / Generation (3) / Visual Reasoning (2) / Spatial Scanning (1). Example questions:

Image

Find anomalies and impossible physics in images/videos;

Draw "a clearly labeled elephant schematic" or generate "a short video of keyboard typing";

Folding/unfolding, mental rotation, reading charts. Achieved: ImageNet/IntPhysics2/SpatialViz etc. ≥ established thresholds.

Image

9 Auditory Processing (A)

What is measured: Phonological Encoding (1) / Speech Recognition (4) / Speech Synthesis (3) / Prosody (1) / Musical Judgment (1). Example questions:

Image

Transcription based on WER metric;

Reading "Wait, you mean the tickets were free this whole time?" with natural continuity;

Following a beat, distinguishing dissonance. Achieved: LibriSpeech test-clean WER <5.83%, test-other <12.69% etc.

Image

10 Processing Speed (S)

What is measured: Perceptual search, perceptual comparison, reading speed, writing speed, mental math, simple reaction time, choice reaction time, inspection time, comparison time, pointer fluency - 10 components, 1% each. Example questions:

Image

Read a passage in 60 seconds and answer "what is feelies";

Respond immediately to a prompt, or quickly press a button under multiple-choice rules;

Draw as many circles as possible with a "mouse/virtual mouse" in 30 seconds. Achieved: Compared against the speed baseline of a "well-educated adult"; thinking pauses also count as time.

Image

Final Results: AGI Has Not Arrived Yet

Evaluation results for GPT-4: 27%; GPT-5: 58%.

Image

GPT-5 showed improvements in knowledge, reading & writing, math, visual/auditory perception, fluid reasoning, and working memory, but long-term memory storage remains at 0%; processing speed also did not improve. The spectrum presents a distinct "sawtooth" pattern: some points are very high, others close to 0.

Image

Based on this, researchers emphasize two conclusions: First, current models are strong in areas dependent on big data pattern learning (knowledge, reading & writing, math), but suffer from severe shortcomings in underlying cognitive "mechanisms" (especially long-term memory writing); Second, although overall progress is rapid, there is still a significant gap from "human-like comprehensive and stable general intelligence."

Two Typical "Capability Distortions"

Researchers caution against mistaking engineering "expedients" for true possession of corresponding cognitive components by the model:

Using extremely long context (WM) to replace long-term memory (MS): Relying on huge "working memory" to cram a day or even a week's worth of material into context can indeed make it "seem capable"; but this is computationally inefficient, unstable, and cannot support day-to-day or week-to-week accumulation. The true solution must be able to write new experiences into the model's persistent memory.

Using external retrieval (RAG) to replace internal retrieval (MR): Retrieval can reduce hallucinations, but it masks two layers of problems: First, the model cannot stably access its own parameterized knowledge; Second, it lacks private, updatable "experiential memory." For AGI, RAG is not a long-term solution and cannot serve as a substitute for memory.

Obstacles and Outlook

Achieving "full marks" requires overcoming a series of challenges: abstract reasoning (e.g., ARC-AGI), intuitive physics and video anomaly understanding, spatial navigation memory, low-hallucination precise retrieval, and true long-term continuous learning. The paper's first author also wrote on his personal social media that AGI is unlikely to appear within one year, but it is highly probable to be achieved within this decade.

Image

The future is here, let's journey together!

<End of Article>

Main Tag:Artificial General Intelligence

Sub Tags:AGI DefinitionFuture of AICognitive ScienceAI Evaluation Framework


Previous:Meta Discovers: Slow RAG Systems Are Doing Too Much Unnecessary Work

Next:A New Perspective on NAS: Graph Neural Networks Drive Universal Architecture Space, Hybrid Convolutional and Transformer Performance Leaps!

Share Short URL