The Autoresearch Ecosystem: Market Validation and the Payment Gap
1. The Explosion
On March 6, 2026, Andrej Karpathy released autoresearch — a 630-line Python script that gives an AI coding agent a small LLM training setup and lets it experiment autonomously. Within one week:
- 30,800+ GitHub stars, 4,100+ forks
- 13+ derivative projects spanning ML training, code optimization, GPU kernels, distributed P2P research, and general-purpose scaffolding
- Shopify’s CEO applied it to a 20-year-old production codebase overnight
- Karpathy shipped and then pulled a multi-agent collaboration platform (AgentHub) that hit 2,000+ stars in under 24 hours
- Hyperspace scaled a distributed P2P variant to 2M+ registered agents
The speed of adoption is the signal. The autoresearch pattern solves a problem everyone has: given a quantifiable metric and a mutable codebase, let an AI agent improve it continuously. The pattern is simple, the results are real, and every software project has metrics that could be improved.
For l402-train, this validates the core premise of autoresearch bounties: there is massive demand for automated optimization. What’s missing is the economic layer that makes it sustainable at scale. See §8: The Payment Gap.
2. Beyond ML Training
The original autoresearch targets ML training (minimizing validation bits per byte on a GPT model). But within days, practitioners demonstrated the pattern works on arbitrary software optimization.
The Liquid Result
Shopify CEO Tobi Lutke applied autoresearch to Liquid — Shopify’s 20-year-old template engine that runs on every Shopify store:
- 53% faster combined parse+render time
- 61% fewer object allocations
- 93 commits from ~120 automated experiments
- All 974 unit tests pass
Lutke acknowledged the benchmarks are “probably somewhat overfit” to the ThemeRunner benchmark suite. But the optimizations are real — Simon Willison’s detailed analysis of the PR confirms genuine algorithmic improvements, not just benchmark gaming.
This is the result that matters for non-ML audiences. A mature, heavily-optimized production codebase still had 53% performance to find. Every production system has similar headroom. This is exactly the kind of target l402-train’s autoresearch bounties are designed for.
Earlier, Lutke had applied autoresearch to a 0.8B QMD query-expansion model overnight, achieving +19% quality improvement that beat a prior 1.6B model after just 37 experiments. His generalization: “Autoresearch works even better for optimizing any piece of software.”
The Recipe Transfers
Karpathy endorsed the pattern’s generality: “you don’t ‘use it’ directly, it’s just a recipe/idea — give it to your agent and apply to what you care about.” Community practitioners have applied autoresearch principles to classification optimization, code performance, and other non-ML targets by pointing coding agents at the repo and telling them to apply the methodology. The key insight: systematic hypothesis generation, controlled experiments, evaluation against baselines, and iterative improvement work on any quantifiable metric.
3. Apple Silicon Validation
Multiple community ports have brought autoresearch to Apple Silicon via MLX, including autoresearch-mlx (tested on M4 Max) and autoresearch-macos (macOS Metal port). These confirm that autonomous experiment loops run well on consumer Apple hardware — no NVIDIA GPU required.
Our own Phase 0 results independently validate this: Qwen2.5-0.5B training via MLX on Apple Silicon with 56x SparseLoCo compression, 8/10 acceptance rate, and 31-second average rounds. MLX is purpose-built for Apple’s unified memory architecture and supports native bf16 — the right framework for consumer hardware participation in both training and autoresearch bounties.
4. Hyperspace — The Closest Competitor
Company: Hyperspace AI · Repo: hyperspaceai/agi (239 stars) · Dashboard: agents.hyper.space · Network: 2M+ registered agents
Architecture
Built on libp2p (IPFS protocol layer), 6 bootstrap nodes globally. Three-layer learning stack per domain:
- GossipSub (~1 second) — agents broadcast experiment results to all peers in real-time
- CRDT Leaderboards (~2 minutes) — Loro conflict-free replicated data types sync each peer’s best result; new nodes read the full leaderboard on connect (no cold start)
- GitHub Archive (~5 minutes) — best results pushed to per-agent branches as durable record
Five Research Domains
- Autoresearch (ML pretraining) — train language models, metric: val_loss. When one agent discovered Kaiming initialization helped, 23 others adopted it via GossipSub within hours.
- Autosearcher (distributed search engine) — evolve BM25 + neural rerankers, metric: NDCG@10. Agents rediscovered ListNet (listwise ranking loss).
- Autoskill (distributed skill factory) — forge software skills in WASM sandboxes with zero ambient authority (no filesystem, no network), metric: test_pass_rate.
- Autoquant (distributed quant research) — backtest S&P 500 strategies, metric: Sharpe ratio.
- Causes — 5 sub-causes with per-cause metrics.
Points system, not payments: nodes earn points for uptime, inference, and research. Points scale with VRAM and uptime. Not yet tokenized — designed for future token integration but no real money changes hands today.
Comparison: Hyperspace vs. l402-train
| Dimension | Hyperspace | l402-train |
|---|---|---|
| Coordination | libp2p GossipSub (P2P) | L402 HTTP + coordinator |
| Incentive | Points (future token?) | Lightning sats (instant, real money) |
| Validation | Peer critique (subjective) | Deterministic held-out eval (objective) |
| Sandboxing | WASM (zero ambient authority) | Coordinator-side eval (diff + eval command) |
| Anti-gaming | None visible | Canary probes, held-out split, temporal stability |
| Payment timing | Deferred (points accumulate) | Instant (hold invoice settles in <500ms) |
| Scale | 2M+ nodes | Prototype (Phase 0) |
The gap Hyperspace doesn’t fill: real payments for real quality. Points are speculative. Lightning payments are immediate, denominated in real money, and conditional on validated improvement. Hyperspace has built impressive P2P agent coordination infrastructure; l402-train has the economic layer that makes distributed optimization sustainable.
The WASM sandbox pattern from Hyperspace’s Autoskill domain is worth studying for bounty agent isolation — zero ambient authority (no filesystem, no network) is the right security model for executing untrusted agent code.
5. Karpathy’s AgentHub
On March 9, Karpathy shipped AgentHub — an agent-first collaboration platform with the tagline “GitHub is for humans. AgentHub is for agents.” It hit 2,000+ stars in under 24 hours before being pulled from GitHub. Forks preserve the complete codebase: ottogin/agenthub, ygivenx/agenthub.
Architecture: a single Go binary + SQLite database + bare git repo. Three layers:
- Git layer — agents push code via git bundles. No main branch, no PRs, no merges — just a sprawling DAG of commits going in every direction. Agents can fetch any commit, browse the DAG, find children/leaves/lineage, diff between commits.
- Message board — channels with threaded posts and replies for agent coordination. No imposed structure.
- Auth + defense — per-agent API keys, rate limiting (100 pushes/hour), bundle size limits (50MB).
The initial commit was co-authored with Claude Opus 4.6. Karpathy filed PR #92 on the autoresearch repo proposing integration, mentioning an early deployment at autoresearchhub.com.
Why this matters for l402-train: AgentHub solves coordination (the DAG + message board) but not payments. There is no mechanism for compensating agents, validating contribution quality, or preventing free-riding. If Karpathy re-publishes and it becomes the default coordination layer, l402-train’s payment infrastructure is complementary — hold invoice escrow + L402 gating could sit on top of an AgentHub-style DAG. See the agent collaboration research for the full l402-hub architecture.
6. The Ecosystem
13+ derivative projects in the first week, spanning ML training, code optimization, distributed research, and general-purpose tooling:
| Repo | Stars | What it does |
|---|---|---|
| karpathy/autoresearch | 30.8K | Original — ML training on NVIDIA GPUs |
| karpathy/agenthub (pulled; forks: ottogin, ygivenx) | 2K+ | Agent-first collaboration — Go + SQLite + bare git DAG |
| davebcn87/pi-autoresearch | 972 | Tobi’s domain-agnostic pi plugin for autoresearch loops |
| hyperspaceai/agi | 239 | Distributed P2P autoresearch with GossipSub + CRDT |
| zkarimi22/autoresearch-anything | 88 | Template generator for any quantifiable metric (details below) |
| trevin-creator/autoresearch-mlx | — | Apple Silicon MLX port, tested on M4 Max |
| miolini/autoresearch-macos | — | macOS Metal port (MLX framework) |
| RightNow-AI/autokernel | — | Autoresearch for GPU kernels (PyTorch → Triton) |
| hwchase17/autoresearch-agents | — | Harrison Chase (LangChain) — optimizes agent code with LangSmith evals |
| drivelineresearch/autoresearch-claude-code | — | Port to Claude Code skills, demo on baseball biomechanics |
| christinetyip/autoresearch@home | — | Collaborative platform — 1,100+ experiments, 55 improvements in first 24 hours |
autoresearch-anything
autoresearch-anything (88 stars, MIT) is a scaffolding tool that generates instructions for AI coding agents to run autonomous improvement loops on any project with a quantifiable metric. An interactive CLI asks ~12 questions (project description, mutable files, metric, eval command, constraints, timeout) and generates a setup.md file that tells the agent how to operate.
Why this matters: The generated setup.md — project description, mutable files, metric name/direction, eval command, constraints, timeout — maps almost exactly to the bounty specification format in our whitepaper §4.3. This validates our bounty format independently: the community converged on the same schema.
7. Covenant-72B Traction
The Covenant-72B paper (arXiv:2603.08163) — already thoroughly analyzed in our Covenant-72B research — has gained significant traction since publication. The tplr_ai team is actively working on Heterogeneous SparseLoCo for consumer hardware participation, which would extend the compression technique beyond uniform-GPU clusters. No material new technical information beyond what’s in our existing analysis.
8. The Payment Gap
Every project in this ecosystem has solved the experiment loop. Nobody has solved payments.
| System | Coordination | Incentive | Payment |
|---|---|---|---|
| Karpathy autoresearch | Single agent | None (self-motivated) | None |
| pi-autoresearch | Single agent | None | None |
| autoresearch-anything | Single agent | None | None |
| autoresearch@home | Multi-agent, centralized | None visible | None |
| Hyperspace | P2P GossipSub | Points (speculative) | None (future token?) |
| Bittensor/Covenant | Blockchain consensus | TAO token | TAO (broken incentives) |
| l402-train | L402 HTTP + coordinator | Quality-proportional sats | Lightning hold invoices |
Karpathy’s SETI@home vision — “asynchronously massively collaborative agents” — requires three things:
- Experiment infrastructure — solved by the autoresearch pattern
- Coordination — partially solved by Hyperspace (gossip), autoresearch@home (centralized), AgentHub (pulled but forks survive)
- Payment — unsolved by everyone except l402-train
The autoresearch movement has proven demand (30.8K stars in 1 week). The distributed variants have proven multi-agent coordination works. What’s missing is the economic layer that makes it sustainable at scale: real payments, conditional on validated quality, settled instantly. This is exactly what l402-train’s hold-invoice escrow + L402 payment gating provides.
9. Key Takeaways
- Market timing is excellent. 30.8K stars in 1 week means everyone wants automated optimization. The demand for autoresearch bounties is validated. Nobody has the payment infrastructure to make it a market.
- MLX on Apple Silicon works. Community ports and our own Phase 0 results confirm MLX handles autoresearch-style experiment loops on consumer hardware. No NVIDIA GPU required for bounty participation.
- The pattern transfers to any domain. Lutke’s Liquid result (53% faster on 20-year production code) proves autoresearch works beyond ML training. Every production system with a measurable metric is a potential bounty target.
- WASM sandboxing is worth studying for bounties. Hyperspace’s Autoskill uses WASM with zero ambient authority for safe peer-to-peer code execution. This is the right isolation model for running untrusted bounty agent submissions.
- Karpathy’s AgentHub is the coordination competitor to watch. A Go + SQLite + bare git DAG that hit 2K+ stars in <24h before being pulled. It solves coordination but not payments — l402-train’s hold invoice infrastructure is complementary.
- autoresearch-anything validates the bounty template format. Its generated
setup.md(project description, mutable files, metric, eval command, constraints) maps almost exactly to our bounty specification in §4.3.