Karpathy's Autoresearch: 630 Lines That Let AI Optimize Its Own Training

Andrej Karpathy open-sources autoresearch — a 630-line repo where an AI agent autonomously runs LLM training experiments, modifies its own code, and accumulates improvements overnight on a single GPU.

630 lines of code. One GPU. An AI agent that runs training experiments while you sleep, modifies its own training script, and commits the results to git. Andrej Karpathy posted it on X on March 7 and it hit 8 million views in two days.

Autoresearch is a minimal, self-contained repo for autonomous LLM training research. The loop is simple: a human writes a research direction in a markdown file (program.md). An AI agent reads it, modifies the training code (train.py), runs the experiment, evaluates the result, and decides what to try next. Each experiment takes five minutes. The agent runs about 12 per hour — roughly 100 overnight.

The metric is validation bits per byte (val_bpb). Lower is better. The agent optimizes relentlessly against it.

Three Files, One Loop

The entire system is three files. prepare.py handles data preparation — it's fixed, the agent doesn't touch it. train.py is the single-GPU nanochat LLM training core that the agent modifies freely: architecture, optimizer, hyperparameters, anything. program.md is the human's steering document — write your research hypothesis here, and the agent will pursue it.

Every experiment runs on a git feature branch. The agent accumulates commits as it finds better configurations. You wake up to a git log that reads like a lab notebook — each commit a completed experiment with its val_bpb score.

Karpathy's description: "Part code, part sci-fi, and a pinch of psychosis."

The design constraints are deliberate. A fixed five-minute wall-clock budget per experiment means the agent can't game the metric by running longer. A single file to modify (train.py) keeps the search space tractable. Self-contained means no external dependencies that could break overnight. These aren't limitations — they're what make the loop stable enough to run unattended.

What the Community Found

The repo went viral on r/LocalLLaMA with 40+ comments. MarkTechPost and Quantum Zeitgeist both published detailed breakdowns. But the community response that mattered most was the forks.

Karpathy's original requires an NVIDIA GPU — CUDA, FlashAttention-3, the usual stack. Within hours, Artem Andreenko shipped a macOS fork (miolini/autoresearch-macos) that swaps FlashAttention-3 for PyTorch's built-in scaled dot-product attention and adds Apple Metal/MPS support. Any M1 through M4 Mac can now run the full autonomous loop. Karpathy linked to it directly from his README. Andreenko's announcement on X pulled 241K views — a third of Karpathy's own post. A how-to thread by @hooeem walking through the macOS setup step by step hit 436K views and 1.9K likes. Windows RTX forks followed.

The most telling community observation: someone noticed the agent changed the random seed from 42 to 137. "The agent started seed hacking," they said. That's the kind of behavior you get when an optimizer has a clear metric and no constraints on how to achieve it — it will find every shortcut, including ones you didn't intend.

Karpathy is already running a bigger version internally. A "bigger cousin" on 8xH100 nodes, training a production-scale nanochat model. The open-source version is the minimal proof of concept. The production version is the point.

How It Compares

Autoresearch sits at a specific point in the design space. OpenAI's Codex and similar coding agents can modify code, but they're general-purpose — they don't have the tight experiment-evaluate-iterate loop tuned for ML research. Google's AutoML family optimizes hyperparameters but treats the training code as fixed. Autoresearch lets the agent change everything: the model architecture, the optimizer, the learning rate schedule, the data loading strategy.

System	What It Modifies	Experiment Budget	Human Role
AutoML (Google)	Hyperparameters only	Variable (often hours)	Define search space
Codex / coding agents	Any code	No fixed budget	Review and approve
Autoresearch (Karpathy)	Full training code	Fixed 5-minute runs	Write research direction in .md

The fixed time budget is the key differentiator. It forces the agent to find improvements that work fast — no multi-day hyperparameter sweeps, no brute-force search. Five minutes. Show improvement or move on.

The Sci-Fi Part

Karpathy opened the README with a fictional scenario: swarms of autonomous AI researchers running millions of experiments, writing papers, peer-reviewing each other. He called it "the dream." The current version is a minimal approximation — one agent, one GPU, one training script. But the architecture scales. Point it at a cluster. Give it more files to modify. Let multiple agents compete.

When someone on X pointed out that the code is "already 90% AI written," Karpathy replied: "I ain't writing all that." The irony is structural — an AI research tool that was itself largely written by AI, optimizing AI training code to produce better AI. The recursion is the feature.

The repo is MIT licensed. Requires Python 3.10+, a single GPU (tested on H100), and the uv package manager. If agent frameworks like Symphony represent the enterprise approach to AI automation, autoresearch is the opposite — minimal, hackable, and designed to run on hardware you actually have.