Back to Blog

When the Code Writes Itself

When the Code Writes ItselfOn June 4, 2026, Anthropic published a report called "When AI builds itself." One number from it moved fast: more than 80% of the cod...

June 6, 20266 min read0 comments
When the Code Writes Itself

When the Code Writes Itself

On June 4, 2026, Anthropic published a report called "When AI builds itself." One number from it moved fast: more than 80% of the code Anthropic now merges into its own codebase is written by Claude, the company's own AI. Two years ago that figure sat in the low single digits. Claude Code launched in February 2025, and the line climbed from there.

A lot of people read that stat and reached for the phrase "AGI is here." The report does not say that. It argues something narrower and, in some ways, stranger: AI has started to speed up the building of AI, and the people doing the building are not sure how far that loop will run.

What the company can actually measure

Anthropic splits the work of building a frontier model into two buckets. Engineering covers writing the code and running the training. Research covers deciding which experiments to run and reading what comes back. The report walks through both, using internal data most companies never share.

On engineering, the headline holds up but comes with a footnote. The typical Anthropic engineer merged eight times as much code per day in the second quarter of 2026 as in 2024. The company is upfront that lines of code is a junk metric for productivity, so it almost certainly overstates the real gain. What it does show is a shape: the line stayed flat for the company's first four years, bent upward in 2025 when Claude began running code instead of just suggesting it, and bent again in 2026 when models started working on their own for hours at a stretch.

The quality is catching up to the quantity. Anthropic staff used to correct or take over from Claude often. That rate has fallen for a year, even on the hardest tasks. On open-ended problems, the kind where the engineer cannot say in advance what a correct answer looks like, Claude's success rate hit 76% in May 2026, up 50 points in six months. One example from the report: a routine upgrade started crashing tens of thousands of training jobs, an engineer handed Claude the incident with little more than cluster access, and Claude isolated a single obscure debugging flag in about two hours. The engineer estimated two to three days for a human.

The research side is messier

Engineering is execution. Research is taste, and taste is where the report gets careful.

Claude is strong at hitting a goal someone else set. Every model release, Anthropic runs the same test: take code that trains a small model and make it run faster without breaking. In May 2025, Claude Opus 4 averaged a 3x speedup. By April 2026, the internal Mythos Preview model hit 52x. A skilled human needs four to eight hours to reach 4x on the same task. Inside a fixed experiment, Claude went from helpful to better-than-human in under a year.

Choosing what to work on is the harder skill, and the evidence here is thinner than the topline numbers suggest. Anthropic studied real research sessions where a human took a wrong turn, then asked various Claude models what they would do next. A separate Claude, able to see how the session ended, judged the calls. In November 2025, Opus 4.5 beat the human choice 51% of the time. By April 2026, Mythos Preview hit 64%.

Read that 64% with the caveat Anthropic attached. The researchers picked moments where the human had already made a shaky move, so this is not Claude versus a researcher at their best. On a control set where the human's next step was already strong, the models won only about 20% of the time. The 64% is a signal that research judgment might be improving, not proof that Claude out-thinks researchers.

The most cited result, that AI agents recovered 97% of a research gap humans could only close 23% of, comes from a single experiment on whether a weaker model can supervise a stronger one. The agents got there over 800 cumulative hours and about $18,000 in compute. Two things hold it back: the result did not transfer cleanly to production-scale models, and humans still chose the problem and wrote the scoring rubric. Within those walls the agents designed every experiment themselves. The wall that matters is the one humans still built.

Three ways this goes

Anthropic lays out three futures and says which one it expects.

The first is a stall. The exponential curves flatten into S-curves, research taste turns out to be something scaling cannot buy, or the bottleneck moves to chips and electricity rather than intelligence. Anthropic includes this case for completeness and says it does not believe it, because no capability it tracks has bent yet.

The second is compounding gains, and the company thinks this is where things are heading. AI handles more of the doing while humans keep setting direction and judging results. A 100-person company starts doing the work of a much larger one. The catch is Amdahl's law: speed up one part of a process and the slow parts become the ceiling. Anthropic has already hit this. So much code now flows through the org that human code review has become the new traffic jam.

The third is full recursive self-improvement, where AI designs and refines its own successors and humans drop back to oversight. This is the one Anthropic cannot predict and is most worried about. The risk it names is alignment. If a model builds the next model, the rare misalignments in today's systems could multiply across generations, growing more common and less understood until people lose the ability to tell which path they are on.

What Anthropic is asking for

The report ends with a request that reads oddly coming from one of the fastest-moving labs in the field. Anthropic wants the world to keep the option of slowing or pausing frontier development, so that safety research and governance can catch up. It wants a way for labs across multiple countries to verify that rivals have actually paused, because a quiet defector who keeps training while others stop would inherit the lead. Training runs hide more easily than missile silos, which makes that verification hard. Anthropic says it would pause if others at the frontier did so verifiably, and is funding work on the systems that would make such a pause checkable.

The part worth keeping

Strip away the timelines and one practical claim survives. The cost of doing the work, writing the code, running the experiment, is falling toward zero in human hours. The scarce thing is deciding what is worth doing at all.

That splits people into two camps already. Some use these tools as a force multiplier and steer far more work than they could a year ago. Others use them as a faster search box. The gap between those two is widening now, regardless of which of Anthropic's three futures arrives.

So the skill to build is not faster typing. It is judgment about which problems matter and which results to trust. The grunt work is getting cheap. Knowing what to point it at is not.

Share this article
SK
Written by

Sohaib Khan

View all posts

You might also like

View all

Comments (0)

No comments yet. Be the first to comment!