Recently, there have been two particularly interesting projects in the AI community: Google DeepMind's AlphaEvolve and UBC University's Darwin Gödel Machine (DGM).
During the holiday, I spent 0.31 RMB to run both systems using the Deepseek model. The results were astonishing:
AlphaEvolve improved the performance of a function optimization algorithm by 8.52% in 3 minutes.
DGM even boosted the performance of a sorting algorithm by 345%—evolving directly from simple bubble sort to highly optimized quicksort.
It was like watching AI reinvent algorithms right in front of me.
Stunning Cost Comparison: DGM's official experiments require approximately 22,000 USD in computational costs for a single run. In contrast, I used the domestic Deepseek model and spent only 0.31 RMB to experience the core self-improvement capabilities of AI. Before you argue with me, if you need to run SWE-bench with Claude 3.6 Sonnet and O3-mini, 0.31 RMB would certainly not be enough. What I mean is the experience of running DGM's main code using the Deepseek-R1-0526 model.
This showed me an important signal: AI self-improvement technology is accelerating, accelerating! Accelerating...
What's even more astonishing is that AlphaEvolve was able to improve the Strassen matrix multiplication algorithm for the first time in 56 years—this has been an open problem in the mathematical community since 1969! Both systems share a common ambition: to let AI improve its own code, no longer requiring human intervention to optimize algorithms.
AlphaEvolve High-Level Overview
Darwin Gödel Machine system overview. DGM iteratively builds a growing archive of agents by alternately performing self-modification and downstream task evaluation.
What is "Self-Improving" AI? It's Not Just About Tuning Parameters
Traditional Method vs. Self-Improvement:
Traditional AutoML/Hyperparameter Optimization: Operates within a human-designed framework, like changing tires on a car, but the car's basic structure remains unchanged.
AlphaEvolve and DGM: Let the car decide whether to grow wings, whether to become a submarine, or even redesign the entire concept of the vehicle.
The core of this self-improvement is that the system can modify its own source code, not just adjust parameters. What does this mean?
It means AI can change:
Its own algorithmic logic
Tool combinations
Entire workflows
Complex mathematical operations
Undiscovered areas by humans
...
It's like a programmer who can not only debug code but also refactor architecture and invent new programming paradigms.
AlphaEvolve: The Evolutionary Engine for Scientific Discovery
How Google Makes AI "Evolve" Code
AlphaEvolve's operation is quite similar to biological evolution, but much smarter than natural selection.
Core Mechanisms:
Program Database: Stores various versions of algorithm code.
Mutation Operator: Uses LLMs like Gemini 2.0 to analyze existing code and propose improvements.
Automatic Evaluation: Filters code through evaluation functions, only retaining better performing code.
Fully Automated Evolutionary Loop:
The prompt sampler selects well-performing code from the program database as "parents".
The LLM generates new code modifications (outputted in diff format) based on these codes and task context.
The evaluator runs and scores these new codes.
Excellent code is added to the database.
You can imagine it as a never-ending Code Review and Refactoring process, except all participants are AI.
From Matrix Multiplication to Mathematical Puzzles, AlphaEvolve Can Handle It All
What is AlphaEvolve's most impressive achievement? It solved a bunch of problems that human experts couldn't crack for decades.
Historical Breakthrough in Matrix Multiplication:
Historical Problem: The optimal algorithm for 4x4 matrix multiplication has been an open problem.
Strassen Algorithm: Proposed in 1969, requiring 49 scalar multiplications, unimproved for 56 years.
AlphaEvolve's Breakthrough: Found an algorithm requiring only 48 multiplications, a significant breakthrough in the complex domain.
AlphaEvolve vs. Previous System FunSearch Capability Comparison
Broader Mathematical Achievements: Researchers applied AlphaEvolve to over 50 mathematical construction problems:
Erdős's minimum overlap problem
11-dimensional kissing number problem
Various geometric packing problems
Remarkable Success Rate:
75% of problems: Rediscovered known optimal solutions.
20% of problems: Found better constructions than known solutions.
What does this success rate indicate? It shows that AI already possesses the ability to make discoveries that surpass human experts in certain fields.
Examples of AlphaEvolve's Breakthrough Mathematical Constructions
Evolution is Not Random Search, But Strategic Exploration
You might think this sounds like brute-force search, but AlphaEvolve's strategy is actually quite sophisticated.
Evaluation Cascade Mechanism:
Newly generated solutions are first validated on simple test cases.
Only if they pass will they proceed to more complex evaluation stages.
It's like multiple rounds of interviews during recruitment, avoiding wasted computational resources.
Multi-objective Optimization Strategy:
Simultaneously pursuing improvements across multiple evaluation metrics.
Even if only one specific metric is of concern, multi-objective optimization often yields better results.
Different evaluation criteria produce excellent programs with different structures, inspiring more creative solutions from the LLM.
AlphaEvolve's Code Change Process for Discovering Faster Matrix Multiplication Algorithms
Verifying AlphaEvolve, What Were the Results?
From Theory to Reality: The Evolution Process of a Function Optimization Task
After all this theory, you might be curious about what these systems actually look like in operation.
My Experiment Setup:
Model: Deepseek-V3
Project: OpenEvolve, the open-source version of AlphaEvolve (see Reference at the end of the article)
Task: Classic function minimization problem
Time: Approximately 3 minutes
Iterations: 5 code evolutions
The results were indeed impressive—not "astonishing" in an exaggerated sense, but rather a tangible, visible improvement.
Swipe left and right to see more
Slide left and right to see more
Figure: Actual running process of AlphaEvolve (OpenEvolve) using the Deepseek model for function optimization
Data Doesn't Lie: A Leap from 0.9035 to 0.9886
Performance Improvement Data:
Initial algorithm score: 0.9035
After 5 iterations: 0.9886
Improvement: 8.52%
You might think this improvement seems small, but keep in mind that this improvement was achieved on an already quite optimized benchmark task. In real engineering scenarios, an 8% performance improvement often means:
Millions in cost savings
Significant improvement in user experience
More Interesting Trade-off Strategy: System performance across different dimensions:
speed_score: Decreased from 1.0000 to 0.9229 (slight decline)
value_score, distance_score, standard_deviation_score: All showed significant improvement.
This shows that AI learned to trade a slight increase in computational complexity for better solution quality—this trade-off strategy is exactly what excellent programmers would decide.
Real-World Imperfections: Error Handling and System Robustness
During the runtime, some interesting phenomena appeared, allowing us to see the system's true performance.
Syntax Error Handling:
Some code variants generated by the system contained syntax errors (e.g., unclosed parentheses).
This is common in automatic code generation.
The key is: AlphaEvolve's evaluation system can automatically detect these errors.
It discards problematic code and continues evolving with other valid variants.
Engineering Issues:
Encountered Unicode encoding problems (common in Windows environments).
Did not affect core algorithm execution.
Reminds us: When deploying research prototypes to production environments, various engineering issues still need to be considered.
DGM: The True "Self-Referential" Improvement System
From Theory to Practice: The Real-World Version of the Gödel Machine
The name Darwin Gödel Machine sounds very academic; its inspiration comes from Schmidhuber's Gödel Machine theory proposed in 2006.
Theory vs. Practice Conversion:
Original Gödel Machine: Required the system to mathematically prove that a modification would be beneficial before making it.
Practical Problem: This is almost impossible to achieve in practice.
DGM's Innovation: Abandoned the mathematical proof requirement, opting instead for empirical verification—judging the effect of modifications through programming benchmarks.
DGM's Core Feature: Self-Referentiality
The biggest difference between DGM and AlphaEvolve lies in "self-referentiality":
DGM not only solves programming tasks
But also improves its own ability to solve programming tasks
And the process of improving itself is itself a programming task.
This forms a recursive loop:
Better programming ability → Better self-improvement ability → Better programming ability
The Ingenuity: Improvements in downstream tasks directly reflect the system's enhanced self-improvement capability.
Open-Ended Exploration: Why Retain "Failed" Attempts
DGM adopts a strategy called "open-ended exploration," which is particularly noteworthy.
Core Concept:
The system maintains an archive containing all historical agents.
Even if some agents appeared to perform poorly at the time, they are not discarded.
Reason: Seemingly "failed" attempts may contain valuable innovations.
DGM's Automated Improvement Process
Real-World Case Proof: In SWE-bench experiments:
DGM's performance temporarily declined during the 4th and 56th iterations.
The system did not roll back to previous versions.
It continued to explore based on these "regressed" versions.
Result: These seemingly regressed branches ultimately produced excellent agents that surpassed all predecessors.
It's like "accidental discoveries" in scientific research—many significant breakthroughs come from seemingly failed experiments.
From 20% to 50%, DGM's Breakthroughs in Programming Tasks
Benchmark Performance:
SWE-bench: Improved from 20.0% to 50.0%.
Polyglot: Improved from 14.2% to 30.7%.
But more importantly, how it did it—the system automatically discovered:
Automatically Discovered Improvements:
More granular file editing tools (supporting line-by-line viewing and string replacement).
Improved multi-round trial mechanism.
Peer review mechanism (using another FM to evaluate and select the best solution).
Key Point: These improvements were not human-pre-designed; they were entirely discovered by the system during its self-exploration process.
DGM in Practice: AI's Evolutionary Path from Bubble Sort
Intelligent Decisions Behind a 345% Performance Increase
Compared to AlphaEvolve's incremental optimization, DGM demonstrated a more aggressive self-improvement strategy.
My Experiment Results:
Model: Deepseek
Task: Sorting algorithm optimization demonstration
Iterations: 3 rounds
Performance jump: From 16.97 to 83.63
Overall improvement: 345.4%
More importantly, we can clearly see how AI carried out "algorithm refactoring" step by step; this improvement far exceeded the scope of traditional parameter tuning.
Swipe left and right to see more
Slide left and right to see more
Figure: The complete process of DGM using the Deepseek model for sorting algorithm self-improvement
Not Parameter Tuning, But Algorithm Reinvention
First Round of Improvement: The Most Shocking Algorithmic Paradigm Shift
AI directly abandoned the original bubble sort implementation and completely rewrote it into an iterative quicksort.
This is not simple code optimization, but a fundamental shift in algorithmic paradigm:
From: O(n²) bubble sort
To: O(n log n) quicksort
AI "realized" the inherent flaws of bubble sort and chose a more suitable algorithm structure. This decision-making ability is already close to the level of a senior algorithm engineer.
Second and Third Rounds: Deep Algorithmic Optimization
Demonstrated AI's deep understanding of algorithmic details:
Hybrid sorting strategy: Using insertion sort for small arrays.
Median-of-three pivot selection.
Stack space usage pattern optimization.
These are textbook-level quicksort optimization techniques, proving that AI has mastered the core principles of algorithm design, not just imitating existing code.
The True Exploration Process: Progress and Regression are Normal
DGM's operation truly reflects the uncertainty of exploration.
Reality of Performance Fluctuations:
Third round score: 83.63
Second round score: 91.36
Phenomenon: The third round actually declined compared to the second round.
System behavior: Did not simply roll back to the previous version.
This "tolerance for temporary regression" strategy is the essence of open-ended exploration—sometimes a seeming step backward can pave the way for a greater breakthrough.
Multi-dimensional Trade-off Capability: We can observe AI's trade-off strategies across different dimensions:
Algorithm correctness
Execution efficiency
Code readability
Memory usage
This multi-objective optimization capability indicates that DGM already possesses quite mature engineering judgment.
Core Differences Between Specialized vs. General Systems
Differentiation in Application Areas: Scientific Discovery vs. Programming Agents
Although both AlphaEvolve and DGM use evolutionary algorithms and LLM-driven code modification, their application focuses are entirely different.
AlphaEvolve: Scientific Discovery Engine
Positioning: Specifically designed to solve scientific and engineering problems with clear evaluation criteria.
Application areas:
Matrix multiplication
Mathematical constructions
System optimization
Strengths: Capable of handling various problem types, from mathematical proofs to engineering optimization.
DGM: General Intelligent Agent
Positioning: Building systems capable of continuous self-improvement.
Focus area: Programming tasks.
Core Hypothesis: If the system can write code better, it can improve itself better.
Theoretical Potential: Self-referential design with infinite improvement potential.
Different Choices in Technical Architecture
AlphaEvolve's Architectural Features:
Distributed asynchronous architecture: Can run thousands of evaluation tasks simultaneously.
Applicable scenarios: Computationally intensive scientific problems.
Evaluation cascade: Filters with simple tests first, then proceeds to in-depth evaluation.
Advantage: Greatly improves efficiency.
DGM's Architectural Features:
Relatively simple architecture: But focuses on "open-ended exploration."
Parent selection mechanism: Considers performance and the number of existing offspring.
Balancing strategy: Both leverages excellent solutions and maintains exploratory diversity.
Traceability: Each agent's modification history is fully recorded.
Practical Applications: What These Systems Can Bring to Your AI Project
AlphaEvolve's Engineering Value: From Algorithm Optimization to System Acceleration
If you are developing AI products that require high-performance computing, the capabilities demonstrated by AlphaEvolve are highly valuable.
Google's Practical Applications: Researchers used it to optimize several key components of Google's computing stack:
Data center scheduling algorithms
Matrix multiplication kernels for LLM training
Arithmetic circuits within TPUs
Transformer attention computation acceleration
These are critical bottlenecks in actual production environments, and any minor improvement can bring immense economic value. However, AlphaEvolve's source code requires application to Google; the OpenEvolve version run above is merely a reproduction.
Implications for Your Project: If AlphaEvolve were applied to optimize your inference services, the system might automatically discover:
New batching strategies
Memory management methods
Algorithm combinations you never thought of
Key Advantage: This optimization is end-to-end; you don't need to pre-define the search space, as the system will explore various possibilities on its own.
DGM's Product Insights: Self-Improving Agent Architecture
DGM's value is more reflected at the system architecture level.
Example Application Scenarios: If you are building complex AI agent systems, such as:
Your customer service bot not only answers user questions
But also automatically improves its dialogue strategy based on user feedback
Optimizes knowledge retrieval methods
And even improves the entire interaction process
Experimental Verification: DGM proved that this self-improvement is not wishful thinking:
SWE-bench: Performance is already close to open-source SOTA level.
Polyglot: Even surpassed human-expert optimized Aider tool.
This shows that, given enough autonomy and appropriate feedback mechanisms, AI can indeed achieve continuous self-improvement.
Challenges: Ideal is Plump, Reality is Bony
Computational Cost: Money-Burning Self-Improvement
When it comes to practical deployment, we have to face a real issue: the computational costs of these systems are not low.
Current Cost Status:
DGM: A full run on SWE-bench takes about 2 weeks, and the API call cost, as shown at the beginning of this article, is 22,000 USD.
AlphaEvolve: Although improved in sampling efficiency, it still requires a large number of LLM calls for complex problems.
Return on Investment Thinking: From another perspective, if the system can automatically discover groundbreaking improvements like the matrix multiplication algorithm, this one-time investment is entirely worthwhile. In other words, it depends on what kind of key discoveries you want to achieve with such a self-evolving system; if you think it's worth it, then run it...
Key Strategy: Choose suitable application scenarios—core algorithms and system components that can bring long-term benefits after improvement.
Security: The Double-Edged Sword of Self-Modification
Letting an AI system modify its own code sounds a bit dangerous.
DGM's Security Measures: Researchers carefully considered security issues:
Sandbox environment
Time limits
Human supervision
Complete modification tracking
Real-world Challenges: But honestly, these measures are definitely not enough in a true production environment. Pandora's box has already been opened; just be ready to pull the plug~
AlphaEvolve's Relative Advantages: It's relatively more conservative in this regard:
Mainly targets scientific problems with clear evaluation criteria.
Risks are relatively controllable (observed only from the paper and reproduced code).
If this self-modification capability is to be applied to a wider range of AI systems, security mechanisms require more research and improvement.
Limitations of Foundation Models: A Clever Cook Cannot Make a Meal Without Rice
Both systems heavily rely on the capabilities of underlying large language models.
Constraints of Model Capabilities:
AlphaEvolve's experiments show that using stronger models indeed yields better results.
The system's upper limit is constrained by current LLM capabilities.
If the underlying model cannot understand complex concepts in a certain field, even the most ingenious evolutionary algorithm will be of no avail.
Some Inspirations
Rethinking AI System Design Patterns
Perhaps the most important revelation from these two projects is: We need to rethink AI system design patterns.
Traditional vs. New Paradigm:
Traditional approach: Humans design the architecture, and AI learns and optimizes within the framework.
New possibility: AI already possesses the ability to participate in or even lead system design.
Design Suggestions: When designing your next AI product, you might consider leaving some "evolvable" space:
Design certain key components as replaceable modules.
Configure automated evaluation mechanisms.
Allow the system to experiment with different implementation schemes.
Carefully draw lessons from the essence of these codes; this way, your product might also gain the potential for continuous self-improvement.
The Importance of Evaluation Mechanisms: No Evolution Without Feedback
Both systems emphasized the importance of automated evaluation, which is highly insightful for our AI product design.
Core Requirements: If you want your AI system to continuously improve, you must design mechanisms that can:
Quickly and accurately evaluate system performance.
Measure the ultimate effect.
Provide sufficient signals to guide the direction of improvement.
Design Principle: Find "proxy metrics"—which are easy to automate evaluation for, and can genuinely reflect the system's core capabilities.
DGM chose programming benchmarks as its evaluation standard because programming ability and self-improvement ability are directly linked.
Perhaps a New Path to AGI?
Self-Improvement: An Essential Path to AGI
In a sense, self-improvement capability might be one of the necessary conditions for AGI.
Characteristics of Human Intelligence: A key characteristic of human intelligence is the ability to:
Reflect on and improve one's own way of thinking.
Learn to learn.
Learn to think.
Current Progress: AlphaEvolve and DGM have made significant explorations in this direction, demonstrating that AI systems can indeed acquire a certain degree of self-improvement capability.
Realistic Assessment: Of course, these systems are currently far from reaching AGI levels; their self-improvement is still confined to specific domains.
But this beginning is very important—just as the earliest neural networks could only recognize simple patterns, but laid the foundation for the deep learning revolution.
Automation of Scientific Discovery: A New Mode of Human-Machine Collaboration
AlphaEvolve's success in mathematical and algorithmic discovery shows us the possibility of automating scientific research.
Future Research Mode: Future scientific discovery may no longer be a purely human activity, but rather:
Deep integration of human intuition + AI computational power.
Humans provide problem definitions and evaluation criteria.
AI is responsible for large-scale exploration and verification.
Experimental Verification: This mode has been verified in AlphaEvolve's mathematical problem research:
Many problems were suggested by mathematicians Javier Gomez Serrano and Terence Tao.
Then the AI system was tasked with finding solutions.
This human-machine collaboration mode may become a new paradigm for future scientific research.
Both and Also
Anyway, AlphaEvolve and DGM both represent an important milestone in AI development.
They tell us that AI is no longer content with:
Passively executing human-designed tasks
But has begun to:
Actively explore possibilities for self-improvement
As AI product developers, we must:
Seize the opportunities brought by this technological advancement
And also seriously address the challenges and risks
Final Question: Are you ready to embrace this era of AI self-improvement? When Google, USC, and others successfully ran AI self-evolution systems using OpenAI and Claude models, you should at least also, like me, run the code with DeepSeek to experience it yourself.
Reference:
AlphaEvolve
Paper: https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/AlphaEvolve.pdf
Code: https://github.com/codelion/openevolve (Note: This is not Google's official source code; please verify)
DGM
Paper: https://arxiv.org/pdf/2505.22954
Code: https://github.com/jennyzzt/dgm
The future is here, let's walk together.
<End of Article, Author: Xiū Māo>
Please contact me for reprinting
🎉Let's create more beauty together!🎉
If you found this article helpful
Thank you for [Liking] and [Watching] me
<Only I can see your likes and watches>
👉WeChat ID: xiumaoprompt
Please state your purpose when adding!