Closer to AGI? Running Google's AlphaEvolve and UBC's DGM for Just 0.31 Yuan?

Recently, there have been two particularly interesting projects in the AI community: Google DeepMind's AlphaEvolve and UBC University's Darwin Gödel Machine (DGM).

Image

Image

During the holiday, I spent 0.31 RMB to run both systems using the Deepseek model. The results were astonishing:

AlphaEvolve improved the performance of a function optimization algorithm by 8.52% in 3 minutes.

DGM even boosted the performance of a sorting algorithm by 345%—evolving directly from simple bubble sort to highly optimized quicksort.

It was like watching AI reinvent algorithms right in front of me.

Image

Stunning Cost Comparison: DGM's official experiments require approximately 22,000 USD in computational costs for a single run. In contrast, I used the domestic Deepseek model and spent only 0.31 RMB to experience the core self-improvement capabilities of AI. Before you argue with me, if you need to run SWE-bench with Claude 3.6 Sonnet and O3-mini, 0.31 RMB would certainly not be enough. What I mean is the experience of running DGM's main code using the Deepseek-R1-0526 model.

Image

This showed me an important signal: AI self-improvement technology is accelerating, accelerating! Accelerating...

What's even more astonishing is that AlphaEvolve was able to improve the Strassen matrix multiplication algorithm for the first time in 56 years—this has been an open problem in the mathematical community since 1969! Both systems share a common ambition: to let AI improve its own code, no longer requiring human intervention to optimize algorithms.

Image

AlphaEvolve High-Level Overview

Image

Darwin Gödel Machine system overview. DGM iteratively builds a growing archive of agents by alternately performing self-modification and downstream task evaluation.

What is "Self-Improving" AI? It's Not Just About Tuning Parameters

Traditional Method vs. Self-Improvement:

Traditional AutoML/Hyperparameter Optimization: Operates within a human-designed framework, like changing tires on a car, but the car's basic structure remains unchanged.

AlphaEvolve and DGM: Let the car decide whether to grow wings, whether to become a submarine, or even redesign the entire concept of the vehicle.

The core of this self-improvement is that the system can modify its own source code, not just adjust parameters. What does this mean?

It means AI can change:

Its own algorithmic logic

Tool combinations

Entire workflows

Complex mathematical operations

Undiscovered areas by humans

...

It's like a programmer who can not only debug code but also refactor architecture and invent new programming paradigms.

AlphaEvolve: The Evolutionary Engine for Scientific Discovery

How Google Makes AI "Evolve" Code

AlphaEvolve's operation is quite similar to biological evolution, but much smarter than natural selection.

Core Mechanisms:

Program Database: Stores various versions of algorithm code.

Mutation Operator: Uses LLMs like Gemini 2.0 to analyze existing code and propose improvements.

Automatic Evaluation: Filters code through evaluation functions, only retaining better performing code.

Fully Automated Evolutionary Loop:

The prompt sampler selects well-performing code from the program database as "parents".

The LLM generates new code modifications (outputted in diff format) based on these codes and task context.

The evaluator runs and scores these new codes.

Excellent code is added to the database.

You can imagine it as a never-ending Code Review and Refactoring process, except all participants are AI.

Image

From Matrix Multiplication to Mathematical Puzzles, AlphaEvolve Can Handle It All

What is AlphaEvolve's most impressive achievement? It solved a bunch of problems that human experts couldn't crack for decades.

Historical Breakthrough in Matrix Multiplication:

Historical Problem: The optimal algorithm for 4x4 matrix multiplication has been an open problem.

Strassen Algorithm: Proposed in 1969, requiring 49 scalar multiplications, unimproved for 56 years.

AlphaEvolve's Breakthrough: Found an algorithm requiring only 48 multiplications, a significant breakthrough in the complex domain.

Image

AlphaEvolve vs. Previous System FunSearch Capability Comparison

Broader Mathematical Achievements: Researchers applied AlphaEvolve to over 50 mathematical construction problems:

Erdős's minimum overlap problem

11-dimensional kissing number problem

Various geometric packing problems

Remarkable Success Rate:

75% of problems: Rediscovered known optimal solutions.

20% of problems: Found better constructions than known solutions.

What does this success rate indicate? It shows that AI already possesses the ability to make discoveries that surpass human experts in certain fields.

Image

Examples of AlphaEvolve's Breakthrough Mathematical Constructions

Evolution is Not Random Search, But Strategic Exploration

You might think this sounds like brute-force search, but AlphaEvolve's strategy is actually quite sophisticated.

Evaluation Cascade Mechanism:

Newly generated solutions are first validated on simple test cases.

Only if they pass will they proceed to more complex evaluation stages.

It's like multiple rounds of interviews during recruitment, avoiding wasted computational resources.

Multi-objective Optimization Strategy:

Simultaneously pursuing improvements across multiple evaluation metrics.

Even if only one specific metric is of concern, multi-objective optimization often yields better results.

Different evaluation criteria produce excellent programs with different structures, inspiring more creative solutions from the LLM.

Image

AlphaEvolve's Code Change Process for Discovering Faster Matrix Multiplication Algorithms

Verifying AlphaEvolve, What Were the Results?

From Theory to Reality: The Evolution Process of a Function Optimization Task

After all this theory, you might be curious about what these systems actually look like in operation.

My Experiment Setup:

Model: Deepseek-V3

Project: OpenEvolve, the open-source version of AlphaEvolve (see Reference at the end of the article)

Task: Classic function minimization problem

Time: Approximately 3 minutes

Iterations: 5 code evolutions

The results were indeed impressive—not "astonishing" in an exaggerated sense, but rather a tangible, visible improvement.

Image

Image

Image

Swipe left and right to see more

Slide left and right to see more

Figure: Actual running process of AlphaEvolve (OpenEvolve) using the Deepseek model for function optimization

Data Doesn't Lie: A Leap from 0.9035 to 0.9886

Performance Improvement Data:

Initial algorithm score: 0.9035

After 5 iterations: 0.9886

Improvement: 8.52%

You might think this improvement seems small, but keep in mind that this improvement was achieved on an already quite optimized benchmark task. In real engineering scenarios, an 8% performance improvement often means:

Millions in cost savings

Significant improvement in user experience

More Interesting Trade-off Strategy: System performance across different dimensions:

speed_score: Decreased from 1.0000 to 0.9229 (slight decline)

value_score, distance_score, standard_deviation_score: All showed significant improvement.

This shows that AI learned to trade a slight increase in computational complexity for better solution quality—this trade-off strategy is exactly what excellent programmers would decide.

Real-World Imperfections: Error Handling and System Robustness

During the runtime, some interesting phenomena appeared, allowing us to see the system's true performance.

Syntax Error Handling:

Some code variants generated by the system contained syntax errors (e.g., unclosed parentheses).

This is common in automatic code generation.

The key is: AlphaEvolve's evaluation system can automatically detect these errors.

It discards problematic code and continues evolving with other valid variants.

Engineering Issues:

Encountered Unicode encoding problems (common in Windows environments).

Did not affect core algorithm execution.

Reminds us: When deploying research prototypes to production environments, various engineering issues still need to be considered.

DGM: The True "Self-Referential" Improvement System

From Theory to Practice: The Real-World Version of the Gödel Machine

The name Darwin Gödel Machine sounds very academic; its inspiration comes from Schmidhuber's Gödel Machine theory proposed in 2006.

Theory vs. Practice Conversion:

Original Gödel Machine: Required the system to mathematically prove that a modification would be beneficial before making it.

Practical Problem: This is almost impossible to achieve in practice.

DGM's Innovation: Abandoned the mathematical proof requirement, opting instead for empirical verification—judging the effect of modifications through programming benchmarks.

DGM's Core Feature: Self-Referentiality

The biggest difference between DGM and AlphaEvolve lies in "self-referentiality":

DGM not only solves programming tasks

But also improves its own ability to solve programming tasks

And the process of improving itself is itself a programming task.

This forms a recursive loop:

Better programming ability → Better self-improvement ability → Better programming ability

The Ingenuity: Improvements in downstream tasks directly reflect the system's enhanced self-improvement capability.

Open-Ended Exploration: Why Retain "Failed" Attempts

DGM adopts a strategy called "open-ended exploration," which is particularly noteworthy.

Core Concept:

The system maintains an archive containing all historical agents.

Even if some agents appeared to perform poorly at the time, they are not discarded.

Reason: Seemingly "failed" attempts may contain valuable innovations.

Image

DGM's Automated Improvement Process

Real-World Case Proof: In SWE-bench experiments:

DGM's performance temporarily declined during the 4th and 56th iterations.

The system did not roll back to previous versions.

It continued to explore based on these "regressed" versions.

Result: These seemingly regressed branches ultimately produced excellent agents that surpassed all predecessors.

It's like "accidental discoveries" in scientific research—many significant breakthroughs come from seemingly failed experiments.

From 20% to 50%, DGM's Breakthroughs in Programming Tasks

Benchmark Performance:

SWE-bench: Improved from 20.0% to 50.0%.

Polyglot: Improved from 14.2% to 30.7%.

But more importantly, how it did it—the system automatically discovered:

Automatically Discovered Improvements:

More granular file editing tools (supporting line-by-line viewing and string replacement).

Improved multi-round trial mechanism.

Peer review mechanism (using another FM to evaluate and select the best solution).

Key Point: These improvements were not human-pre-designed; they were entirely discovered by the system during its self-exploration process.

Image

DGM in Practice: AI's Evolutionary Path from Bubble Sort

Intelligent Decisions Behind a 345% Performance Increase

Compared to AlphaEvolve's incremental optimization, DGM demonstrated a more aggressive self-improvement strategy.

My Experiment Results:

Model: Deepseek

Task: Sorting algorithm optimization demonstration

Iterations: 3 rounds

Performance jump: From 16.97 to 83.63

Overall improvement: 345.4%

More importantly, we can clearly see how AI carried out "algorithm refactoring" step by step; this improvement far exceeded the scope of traditional parameter tuning.

Image

Image

Image

Image

Image

Image

Swipe left and right to see more

Slide left and right to see more

Figure: The complete process of DGM using the Deepseek model for sorting algorithm self-improvement

Not Parameter Tuning, But Algorithm Reinvention

First Round of Improvement: The Most Shocking Algorithmic Paradigm Shift

AI directly abandoned the original bubble sort implementation and completely rewrote it into an iterative quicksort.

This is not simple code optimization, but a fundamental shift in algorithmic paradigm:

From: O(n²) bubble sort

To: O(n log n) quicksort

AI "realized" the inherent flaws of bubble sort and chose a more suitable algorithm structure. This decision-making ability is already close to the level of a senior algorithm engineer.

Second and Third Rounds: Deep Algorithmic Optimization

Demonstrated AI's deep understanding of algorithmic details:

Hybrid sorting strategy: Using insertion sort for small arrays.

Median-of-three pivot selection.

Stack space usage pattern optimization.

These are textbook-level quicksort optimization techniques, proving that AI has mastered the core principles of algorithm design, not just imitating existing code.

The True Exploration Process: Progress and Regression are Normal

DGM's operation truly reflects the uncertainty of exploration.

Reality of Performance Fluctuations:

Third round score: 83.63

Second round score: 91.36

Phenomenon: The third round actually declined compared to the second round.

System behavior: Did not simply roll back to the previous version.

This "tolerance for temporary regression" strategy is the essence of open-ended exploration—sometimes a seeming step backward can pave the way for a greater breakthrough.

Multi-dimensional Trade-off Capability: We can observe AI's trade-off strategies across different dimensions:

Algorithm correctness

Execution efficiency

Code readability

Memory usage

This multi-objective optimization capability indicates that DGM already possesses quite mature engineering judgment.

Core Differences Between Specialized vs. General Systems

Differentiation in Application Areas: Scientific Discovery vs. Programming Agents

Although both AlphaEvolve and DGM use evolutionary algorithms and LLM-driven code modification, their application focuses are entirely different.

AlphaEvolve: Scientific Discovery Engine

Positioning: Specifically designed to solve scientific and engineering problems with clear evaluation criteria.

Application areas:

Matrix multiplication

Mathematical constructions

System optimization

Strengths: Capable of handling various problem types, from mathematical proofs to engineering optimization.

DGM: General Intelligent Agent

Positioning: Building systems capable of continuous self-improvement.

Focus area: Programming tasks.

Core Hypothesis: If the system can write code better, it can improve itself better.

Theoretical Potential: Self-referential design with infinite improvement potential.

Different Choices in Technical Architecture

AlphaEvolve's Architectural Features:

Distributed asynchronous architecture: Can run thousands of evaluation tasks simultaneously.

Applicable scenarios: Computationally intensive scientific problems.

Evaluation cascade: Filters with simple tests first, then proceeds to in-depth evaluation.

Advantage: Greatly improves efficiency.

DGM's Architectural Features:

Relatively simple architecture: But focuses on "open-ended exploration."

Parent selection mechanism: Considers performance and the number of existing offspring.

Balancing strategy: Both leverages excellent solutions and maintains exploratory diversity.

Traceability: Each agent's modification history is fully recorded.

Practical Applications: What These Systems Can Bring to Your AI Project

AlphaEvolve's Engineering Value: From Algorithm Optimization to System Acceleration

If you are developing AI products that require high-performance computing, the capabilities demonstrated by AlphaEvolve are highly valuable.

Google's Practical Applications: Researchers used it to optimize several key components of Google's computing stack:

Data center scheduling algorithms

Matrix multiplication kernels for LLM training

Arithmetic circuits within TPUs

Transformer attention computation acceleration

These are critical bottlenecks in actual production environments, and any minor improvement can bring immense economic value. However, AlphaEvolve's source code requires application to Google; the OpenEvolve version run above is merely a reproduction.

Implications for Your Project: If AlphaEvolve were applied to optimize your inference services, the system might automatically discover:

New batching strategies

Memory management methods

Algorithm combinations you never thought of

Key Advantage: This optimization is end-to-end; you don't need to pre-define the search space, as the system will explore various possibilities on its own.

DGM's Product Insights: Self-Improving Agent Architecture

DGM's value is more reflected at the system architecture level.

Example Application Scenarios: If you are building complex AI agent systems, such as:

Your customer service bot not only answers user questions

But also automatically improves its dialogue strategy based on user feedback

Optimizes knowledge retrieval methods

And even improves the entire interaction process

Experimental Verification: DGM proved that this self-improvement is not wishful thinking:

SWE-bench: Performance is already close to open-source SOTA level.

Polyglot: Even surpassed human-expert optimized Aider tool.

This shows that, given enough autonomy and appropriate feedback mechanisms, AI can indeed achieve continuous self-improvement.

Image

Challenges: Ideal is Plump, Reality is Bony

Computational Cost: Money-Burning Self-Improvement

When it comes to practical deployment, we have to face a real issue: the computational costs of these systems are not low.

Current Cost Status:

DGM: A full run on SWE-bench takes about 2 weeks, and the API call cost, as shown at the beginning of this article, is 22,000 USD.

AlphaEvolve: Although improved in sampling efficiency, it still requires a large number of LLM calls for complex problems.

Return on Investment Thinking: From another perspective, if the system can automatically discover groundbreaking improvements like the matrix multiplication algorithm, this one-time investment is entirely worthwhile. In other words, it depends on what kind of key discoveries you want to achieve with such a self-evolving system; if you think it's worth it, then run it...

Key Strategy: Choose suitable application scenarios—core algorithms and system components that can bring long-term benefits after improvement.

Security: The Double-Edged Sword of Self-Modification

Letting an AI system modify its own code sounds a bit dangerous.

DGM's Security Measures: Researchers carefully considered security issues:

Sandbox environment

Time limits

Human supervision

Complete modification tracking

Real-world Challenges: But honestly, these measures are definitely not enough in a true production environment. Pandora's box has already been opened; just be ready to pull the plug~

AlphaEvolve's Relative Advantages: It's relatively more conservative in this regard:

Mainly targets scientific problems with clear evaluation criteria.

Risks are relatively controllable (observed only from the paper and reproduced code).

If this self-modification capability is to be applied to a wider range of AI systems, security mechanisms require more research and improvement.

Limitations of Foundation Models: A Clever Cook Cannot Make a Meal Without Rice

Both systems heavily rely on the capabilities of underlying large language models.

Constraints of Model Capabilities:

AlphaEvolve's experiments show that using stronger models indeed yields better results.

The system's upper limit is constrained by current LLM capabilities.

If the underlying model cannot understand complex concepts in a certain field, even the most ingenious evolutionary algorithm will be of no avail.

Some Inspirations

Rethinking AI System Design Patterns

Perhaps the most important revelation from these two projects is: We need to rethink AI system design patterns.

Traditional vs. New Paradigm:

Traditional approach: Humans design the architecture, and AI learns and optimizes within the framework.

New possibility: AI already possesses the ability to participate in or even lead system design.

Design Suggestions: When designing your next AI product, you might consider leaving some "evolvable" space:

Design certain key components as replaceable modules.

Configure automated evaluation mechanisms.

Allow the system to experiment with different implementation schemes.

Carefully draw lessons from the essence of these codes; this way, your product might also gain the potential for continuous self-improvement.

The Importance of Evaluation Mechanisms: No Evolution Without Feedback

Both systems emphasized the importance of automated evaluation, which is highly insightful for our AI product design.

Core Requirements: If you want your AI system to continuously improve, you must design mechanisms that can:

Quickly and accurately evaluate system performance.

Measure the ultimate effect.

Provide sufficient signals to guide the direction of improvement.

Design Principle: Find "proxy metrics"—which are easy to automate evaluation for, and can genuinely reflect the system's core capabilities.

DGM chose programming benchmarks as its evaluation standard because programming ability and self-improvement ability are directly linked.

Perhaps a New Path to AGI?

Self-Improvement: An Essential Path to AGI

In a sense, self-improvement capability might be one of the necessary conditions for AGI.

Characteristics of Human Intelligence: A key characteristic of human intelligence is the ability to:

Reflect on and improve one's own way of thinking.

Learn to learn.

Learn to think.

Current Progress: AlphaEvolve and DGM have made significant explorations in this direction, demonstrating that AI systems can indeed acquire a certain degree of self-improvement capability.

Realistic Assessment: Of course, these systems are currently far from reaching AGI levels; their self-improvement is still confined to specific domains.

But this beginning is very important—just as the earliest neural networks could only recognize simple patterns, but laid the foundation for the deep learning revolution.

Automation of Scientific Discovery: A New Mode of Human-Machine Collaboration

AlphaEvolve's success in mathematical and algorithmic discovery shows us the possibility of automating scientific research.

Future Research Mode: Future scientific discovery may no longer be a purely human activity, but rather:

Deep integration of human intuition + AI computational power.

Humans provide problem definitions and evaluation criteria.

AI is responsible for large-scale exploration and verification.

Experimental Verification: This mode has been verified in AlphaEvolve's mathematical problem research:

Many problems were suggested by mathematicians Javier Gomez Serrano and Terence Tao.

Then the AI system was tasked with finding solutions.

This human-machine collaboration mode may become a new paradigm for future scientific research.

Both and Also

Anyway, AlphaEvolve and DGM both represent an important milestone in AI development.

They tell us that AI is no longer content with:

Passively executing human-designed tasks

But has begun to:

Actively explore possibilities for self-improvement

As AI product developers, we must:

Seize the opportunities brought by this technological advancement

And also seriously address the challenges and risks

Final Question: Are you ready to embrace this era of AI self-improvement? When Google, USC, and others successfully ran AI self-evolution systems using OpenAI and Claude models, you should at least also, like me, run the code with DeepSeek to experience it yourself.

Reference:

AlphaEvolve

Paper: https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/AlphaEvolve.pdf

Code: https://github.com/codelion/openevolve (Note: This is not Google's official source code; please verify)

DGM

Paper: https://arxiv.org/pdf/2505.22954

Code: https://github.com/jennyzzt/dgm

The future is here, let's walk together.

Image

<End of Article, Author: Xiū Māo>

Please contact me for reprinting

🎉Let's create more beauty together!🎉

If you found this article helpful

Thank you for [Liking] and [Watching] me

<Only I can see your likes and watches>

👉WeChat ID: xiumaoprompt

Please state your purpose when adding!

Main Tag:AI Self-Improvement

Sub Tags:Algorithmic OptimizationAI ResearchArtificial General IntelligenceLarge Language Models


Previous:Terry Tao Reveals: AlphaEvolve Breaks 18-Year Unsolved Problem Three Times in a Month, Completely Rewriting Rules of Mathematical Research

Next:A Deep Dive: Measuring the Human Brain, Consciousness, and AI Through the 'Nature of Time'?

Share Short URL