In March 2016, professional
But later, people realized it was actually a brilliant strategy.
That move, now known as “Move 37,” showed how AI could think in surprising and creative ways, changing how people see the game.
Jump ahead to 2025. The “board” is no longer a 19 × 19 grid of Go—it’s the entire landscape of computer science, mathematics, and large-scale systems engineering.
The new contender isn’t exploring a few hundred game positions per second; it’s AlphaEvolve: a squad of frontier large language models (LLMs) that rewrite, critique, and evolve complete codebases while an automated test harness keeps score. Where AlphaGo blended
Watching AlphaEvolve gives the impression of experiencing numerous Move 37 moments in quick succession. First, it discovered a shortcut for multiplying two 4 × 4 grids of numbers, needing just 48 basic calculations—breaking the long-standing record of 49 steps set in the late 1960s. Next, it tackled over fifty stubborn math puzzles and solved roughly one-fifth of them with brand-new answers. On the engineering front, it fine-tuned a key routine inside Google’s Gemini training software, making it 23% faster and shaving days off every training run. These breakthroughs weren’t flashes of human inspiration; they emerged from an automated laboratory that never sleeps and is already optimizing parts of the very infrastructure that powers it.
The Go board taught us that a machine’s odd-looking move might be genius in disguise; AlphaEvolve suggests the same can be true for code.
In this newsletter, we’ll dissect:
How AlphaEvolve works.
What it has achieved so far.
Why its evolutionary feedback loop could make software, algorithms, and even hardware design the next great frontier of automated discovery.
Let's get started.
At its core, AlphaEvolve is an AI “coding agent” that can invent and optimize computer algorithms automatically.
In essence, it combines the creativity of Google’s Gemini language models with an
To anyone who has worked in the tech industry, it’s a known fact that software and algorithm design is still dominated by the time-intensive loop of ideate → code → test → debug”—a rhythm measured in sprint hours or days. Google DeepMind’s own paper notes that “discovering new high-value knowledge … generally requires a prolonged process of ideation, exploration, back-tracking, experimentation, and validation” before a single breakthrough appears. That latency leaves many classic problems—matrix multiplication ranks, data center scheduling, compiler kernels—effectively frozen in time. Large language models promised to accelerate the creative step, but anyone who has probed their raw outputs knows the flip side: hallucination. A model that “sounds” correct but compiles to garbage still demands a human in the loop.
AlphaEvolve closes that gap with ruthless, machine-grade feedback. Every candidate program is executed inside an automated evaluator; only those that earn better scores survive to mutate again, which “allows AlphaEvolve to avoid any incorrect suggestions from the base LLM” and push far beyond earlier agent pipelines. The result is a 24/7 laboratory where thousands of ideas are tried, scored, and either discarded or propagated—no hallucinations slip through untested.
Imagine handing a thousand bright interns a snippet of code, a stopwatch, and a one-line brief: “make this run faster.” Now replace the interns with a pair of cutting-edge language models. Feed their proposals into an automatic scoreboard that accepts or rejects them in seconds. Keep that loop humming day and night. The result is AlphaEvolve: an autonomous laboratory where variation is supplied by LLMs and selection by an unforgiving test harness. Over many generations, the codebase mutates, competes, and gradually climbs your metric—just as biological populations climb a fitness landscape.
The user defines the task and provides an evaluation function that can automatically score any candidate solution (for example, a unit test or performance metric).
Then AlphaEvolve takes over in a loop:
Prompt sampler: It picks a promising existing program (from its database of solutions) and builds a prompt for the AI. The prompt includes the current code and hints on what to improve.
LLM ensemble: It sends this prompt to an ensemble of Gemini language models. In practice, DeepMind used two: Gemini Flash (a fast, efficient model to explore many ideas) and Gemini Pro (a powerful model to refine the best ideas). Together, they propose code variants—either small “diffs” or whole code blocks—that might improve the algorithm.
Evaluators: Each new “child” program is then run automatically on test inputs. An evaluator script computes scores (accuracy, speed, etc.) for each candidate. These scores tell us which ideas worked best.
Selection and database: All scored programs (and their results) go into a “Program Database”. An evolutionary selection algorithm (inspired by methods like MAP-Elites) chooses which solutions become parents of the next generation. Poor solutions are dropped, while top performers “survive” to be mutated further.
The process repeats for many generations, constantly producing better algorithms. Since everything is coded and tested by a machine, AlphaEvolve can try out hundreds or thousands of variations in parallel, much more than any human team could.
Because AlphaEvolve keeps quantitative scores for each candidate, it can systematically optimize code. DeepMind notes that this makes it especially suited to domains where progress can be clearly measured—for example, mathematical problems or system optimizations where correctness and performance are testable. In such domains, AlphaEvolve effectively grades every idea and searches for higher scores, just as a coach might run tournaments to find the best strategy.
In practical terms, AlphaEvolve is like a genetic algorithm crossed with a code-writing AI: new code “mutations” are proposed by LLMs and then ranked by an automated judge. Over time, good code snippets propagate and combine, leading to increasingly strong algorithms. (This is analogous to AlphaZero’s self-play, except here the “game” is solving an engineering or math challenge. In each round, the best program solutions teach the models to propose even more promising offspring.)
The AlphaEvolve paper offers a realistic, hands-on example of how AlphaEvolve actually works when applied to a common task: improving a JAX-based image classification pipeline.
Think of this figure as a snapshot of a “before and after”— a playground where the AI agent iteratively tweaks a machine learning model to make it better.
Let’s walk through the process in three steps.
This is where you (the developer or scientist) come in. You provide:
An existing codebase, like a JAX program that trains a simple ConvNet (a convolutional neural network).
Special markers in the code, using comments like # EVOLVE-BLOCK-START and # EVOLVE-BLOCK-END, indicate to AlphaEvolve which parts of the code it is allowed to modify.
An evaluation function, like evaluate(), which runs the model and returns scores such as accuracy or loss.
You can say that at this stage, you’re giving AlphaEvolve a cooking recipe and saying, “You can mess with the spices and the cooking time, but leave the main ingredients alone—and judge success by how good the food tastes.”
AlphaEvolve constructs a prompt to send to a language model. This prompt includes:
The current version of the code (the “baseline” model).
Performance stats of that version (e.g., accuracy = 86.2%).
Instructions, like: “Act as an expert software developer. Your task is to iteratively improve the provided codebase... Propose a new idea...”
It may also show previously successful programs with their scores to give the model some inspiration. This is like showing an AI chef several good recipes and asking, “Can you come up with a better one—maybe tastier, faster to make, or healthier?”
If you want to try your luck with an optimization loop with your own choice of LLMs, you can try and recreate the following prompt for your own code:
You can annotate your own code with EVOLVE-BLOCK markers like above. Then, integrate a scoring function via evaluate() and build prompts for your favorite LLM. Congratulations— you’ve got the skeleton of your own AlphaEvolve-style optimization loop.
The LLM responds with diff blocks, showing specific “search and replace” edits to make. For example, for the JAX code they optimized, it returned something like:
<<<<<<< SEARCHreturn optax.adam(learning_rate)=======return optax.adamw(learning_rate, weight_decay=1e-4)>>>>>>> REPLACE
This improved the code by swapping the optimizer from Adam to AdamW with weight decay for better regularization. Then AlphaEvolve runs this new version, scores it using your provided evaluate() function, and— if it performs better— uses it to inspire future generations.
This is like the AI chef says, “Add garlic, double the simmer time, and use coconut oil instead of butter— here’s your new dish. Taste test it!”
Why is this remarkable? It’s transparent: you get readable diffs, not a mysterious blob of model weights. It’s iterative: if the change works, great—if not, it gets discarded. Finally, it’s automated: AlphaEvolve handles proposing, modifying, running, and scoring—freeing you to guide direction instead of micro-editing code.
AlphaEvolve is no theory—it has already yielded impressive results in Google’s own systems and in open scientific challenges. The team applied it to a variety of problems, demonstrating broad capabilities:
Data center scheduling: By framing Google’s cluster scheduling as an optimization problem, AlphaEvolve discovered a new heuristic function to assign jobs to machines. The result is a simple Python function that scores job–machine matches based on CPU and memory ratios. Deployed in production for over a year, this rule has freed up about 0.7% of Google’s total compute resources– roughly the equivalent of an entire data center’s worth of efficiency. Importantly, the solution is just a small piece of code, so it’s human-readable and easy to verify.
Hardware circuit design: Given the Verilog code for a key matrix-multiplication circuit in Google’s AI chips, AlphaEvolve found a clever rewrite that removes unnecessary logic bits. The proposed change was automatically checked (to ensure it was functionally correct) and then integrated into a future TPU design. In other words, the AI acted like a chip designer: it suggested an optimized circuit in standard Verilog, which engineers could review and include in hardware.
AI training and inference kernels: To speed up DeepMind’s own model training, AlphaEvolve optimized low-level compute kernels. For example, it learned a smarter way to tile (split) matrix multiplication in Gemini’s architecture. This tweak increased the critical GEMM kernel by 23%, shaving about 1% off Gemini’s overall training time. It also tackled GPU code for FlashAttention (an important Transformer operation) and achieved up to a 32.5% speedup there. Such optimizations are remarkable because these low-level routines are normally already finely tuned by experts. The AI-generated improvements now let engineers incorporate them to save weeks of work, cutting “expert effort to days” in some cases
Mathematical algorithms: One of AlphaEvolve’s most striking feats is in pure algorithm discovery. Starting from a minimal program framework for matrix multiplication, it invented new algorithmic components. In particular, it found a method to multiply two 4×4 complex-number matrices using only 48 scalar multiplications, beating the long-standing best-known bound. This is the first improvement in over 56 years on the classic Strassen algorithm for this problem. (By contrast, DeepMind’s earlier AlphaTensor system was specialized for matrix math, but had only improved the special case of binary matrices—AlphaEvolve went further to general complex numbers.)
Open scientific problems: The team also tested AlphaEvolve on about 50 open problems in pure math and CS (from combinatorics to geometry). They gave it the problem definitions and code templates, then let it run. In roughly 75% of cases, the system rediscovered the known best solutions. Impressively, in about 20% of cases, it improved the state of the art. For example, on the 11-dimensional “kissing number” problem (arranging non-overlapping spheres around a central sphere), AlphaEvolve found a configuration of 593 outer spheres, setting a new lower bound in that dimension. This means it actually solved an unsolved math puzzle better than any human before.
DeepMind emphasizes the importance of having readable, verifiable code. For instance, the scheduling heuristic is a short Python function that humans can easily understand. This transparency makes it practical to adopt the results in critical systems.
Well, the “AI is coming for your job” joke is back—and this time, it brought a compiler.
But before you dust off your résumé or start networking with sentient GPUs, take a breath. AlphaEvolve isn’t about replacing software engineers; it’s about rewriting what’s possible with them, and who gets to participate in the most complex layers of computing.
For years, writing high-performance kernels, optimizing matrix multiplication routines, or tweaking scheduling heuristics for distributed systems was the domain of deeply specialized engineers—often Ph.D.s working at hyperscalers. The rest of us built on those foundations, rarely venturing into those depths unless we had to.
Now imagine this: with AlphaEvolve as a co-pilot, even engineers without a background in numerical optimization or Verilog might be able to frame a performance problem, define a measurable goal, and let the system iterate toward high-efficiency solutions. You bring the intent and the scaffolding— AlphaEvolve explores the search space. Suddenly, work that was once “only for the algorithm elite” becomes far more accessible to everyday engineers, researchers, and even early-career devs.
With tools like AlphaEvolve, even a junior engineer working with high-level frameworks such as JAX or PyTorch can begin exploring low-level optimization techniques—changing network architecture, experimenting with regularization, or tuning training strategies—and allow the system to propose, test, and evolve those ideas. It’s no longer about finding the best tweak. It’s about framing the right experiment and letting the AI handle the details.
This isn’t about automating away the engineer. It’s about automating away the bottlenecks—the painstaking trial-and-error, the guesswork, the long feedback cycles. You still decide what matters, set the constraints, and define success. The machine just helps you get there faster, and sometimes uncovers ideas no human would have thought to try.
While impressive, AlphaEvolve has clear limitations.
The biggest requirement is that the problem must have a clear, automatic evaluation metric. In other words, the solution must be something we can programmatically test for correctness or performance. DeepMind explains that any task needing human trial-and-error or subjective judgment is out of scope. This means AlphaEvolve shines in math, computer science, and engineering tasks (where answers can be verified by code) but cannot, for example, write a novel or design a new product by itself.
Another practical constraint is compute cost. Running thousands of code variations and evaluations can be expensive, especially for heavy tasks (one example in the paper mentioned some solutions taking ~100 compute hours to test). DeepMind addresses this by parallelizing many evaluations, but in general, using AlphaEvolve on a new problem is more like launching an experimental research project than running a quick API call.
Looking ahead, DeepMind expects AlphaEvolve to grow stronger as its underlying AI models improve. They specifically note that as Gemini (and future LLMs) become “even better at coding,” AlphaEvolve’s ideas will improve too. To help users experiment, DeepMind is working on a user-friendly interface and plans an early-access program for academics.
Finally, because AlphaEvolve is fundamentally general-purpose, the team is excited about applying it beyond current domains. Any field where solutions can be cast as algorithms might benefit—DeepMind mentions materials science, drug discovery, sustainability, and more. For now, success stories have focused on math and Google’s infrastructure, but the vision is that this kind of AI agent could eventually tackle any algorithm design problem with a testable goal.
AlphaEvolve represents a bold new frontier in AI-assisted programming: a system that can invent and refine algorithms on its own, combining the creativity of large language models with the rigor of automated testing. For software engineers, it signals a future where AI helps shoulder the burden of deep optimization, freeing humans to focus on the big picture—design, direction, and intent.
By turning algorithm discovery into a game of generate-and-score, AlphaEvolve has already outperformed experts in areas like cluster scheduling, chip design, and pure mathematics. And it’s only just getting started. As the technology matures, more parts of the software stack—from performance-critical kernels to elegant mathematical routines—may be co-developed with these evolutionary AI agents.
If you’re wondering how to stay ahead of the curve and actually start building with the tools behind AlphaEvolve, we've got you covered with the following courses:
These aren’t just trendy topics but the building blocks of tomorrow’s engineering toolkit. Whether you’re aiming to accelerate your career or just nerd out on cutting-edge AI, now is the perfect time to jump in. So go ahead, experiment boldly, and don’t be afraid to evolve your code—or your skills.