The Autoresearch Ecosystem: Market Validation and the Payment Gap

Date: 2026-03-13 · Scope: Ecosystem explosion, key players, Hyperspace comparison, AgentHub, and why nobody has solved payments

1. The Explosion

On March 6, 2026, Andrej Karpathy released autoresearch — a 630-line Python script that gives an AI coding agent a small LLM training setup and lets it experiment autonomously. Within one week:

30,800+ GitHub stars, 4,100+ forks
13+ derivative projects spanning ML training, code optimization, GPU kernels, distributed P2P research, and general-purpose scaffolding
Shopify’s CEO applied it to a 20-year-old production codebase overnight
Karpathy shipped and then pulled a multi-agent collaboration platform (AgentHub) that hit 2,000+ stars in under 24 hours
Hyperspace scaled a distributed P2P variant to 2M+ registered agents

The speed of adoption is the signal. The autoresearch pattern solves a problem everyone has: given a quantifiable metric and a mutable codebase, let an AI agent improve it continuously. The pattern is simple, the results are real, and every software project has metrics that could be improved.

For l402-train, this validates the core premise of autoresearch bounties: there is massive demand for automated optimization. What’s missing is the economic layer that makes it sustainable at scale. See §8: The Payment Gap.

2. Beyond ML Training

The original autoresearch targets ML training (minimizing validation bits per byte on a GPT model). But within days, practitioners demonstrated the pattern works on arbitrary software optimization.

The Liquid Result

Shopify CEO Tobi Lutke applied autoresearch to Liquid — Shopify’s 20-year-old template engine that runs on every Shopify store:

53% faster combined parse+render time
61% fewer object allocations
93 commits from ~120 automated experiments
All 974 unit tests pass

Lutke acknowledged the benchmarks are “probably somewhat overfit” to the ThemeRunner benchmark suite. But the optimizations are real — Simon Willison’s detailed analysis of the PR confirms genuine algorithmic improvements, not just benchmark gaming.

This is the result that matters for non-ML audiences. A mature, heavily-optimized production codebase still had 53% performance to find. Every production system has similar headroom. This is exactly the kind of target l402-train’s autoresearch bounties are designed for.

Earlier, Lutke had applied autoresearch to a 0.8B QMD query-expansion model overnight, achieving +19% quality improvement that beat a prior 1.6B model after just 37 experiments. His generalization: “Autoresearch works even better for optimizing any piece of software.”

The Recipe Transfers

Karpathy endorsed the pattern’s generality: “you don’t ‘use it’ directly, it’s just a recipe/idea — give it to your agent and apply to what you care about.” Community practitioners have applied autoresearch principles to classification optimization, code performance, and other non-ML targets by pointing coding agents at the repo and telling them to apply the methodology. The key insight: systematic hypothesis generation, controlled experiments, evaluation against baselines, and iterative improvement work on any quantifiable metric.

3. Apple Silicon Validation

Multiple community ports have brought autoresearch to Apple Silicon via MLX, including autoresearch-mlx (tested on M4 Max) and autoresearch-macos (macOS Metal port). These confirm that autonomous experiment loops run well on consumer Apple hardware — no NVIDIA GPU required.

Our own Phase 0 results independently validate this: Qwen2.5-0.5B training via MLX on Apple Silicon with 56x SparseLoCo compression, 8/10 acceptance rate, and 31-second average rounds. MLX is purpose-built for Apple’s unified memory architecture and supports native bf16 — the right framework for consumer hardware participation in both training and autoresearch bounties.

4. Hyperspace — The Closest Competitor

Company: Hyperspace AI · Repo: hyperspaceai/agi (239 stars) · Dashboard: agents.hyper.space · Network: 2M+ registered agents

Architecture

Built on libp2p (IPFS protocol layer), 6 bootstrap nodes globally. Three-layer learning stack per domain:

GossipSub (~1 second) — agents broadcast experiment results to all peers in real-time
CRDT Leaderboards (~2 minutes) — Loro conflict-free replicated data types sync each peer’s best result; new nodes read the full leaderboard on connect (no cold start)
GitHub Archive (~5 minutes) — best results pushed to per-agent branches as durable record

Five Research Domains

Autoresearch (ML pretraining) — train language models, metric: val_loss. When one agent discovered Kaiming initialization helped, 23 others adopted it via GossipSub within hours.
Autosearcher (distributed search engine) — evolve BM25 + neural rerankers, metric: NDCG@10. Agents rediscovered ListNet (listwise ranking loss).
Autoskill (distributed skill factory) — forge software skills in WASM sandboxes with zero ambient authority (no filesystem, no network), metric: test_pass_rate.
Autoquant (distributed quant research) — backtest S&P 500 strategies, metric: Sharpe ratio.
Causes — 5 sub-causes with per-cause metrics.

Points system, not payments: nodes earn points for uptime, inference, and research. Points scale with VRAM and uptime. Not yet tokenized — designed for future token integration but no real money changes hands today.

Comparison: Hyperspace vs. l402-train

Dimension	Hyperspace	l402-train
Coordination	libp2p GossipSub (P2P)	L402 HTTP + coordinator
Incentive	Points (future token?)	Lightning sats (instant, real money)
Validation	Peer critique (subjective)	Deterministic held-out eval (objective)
Sandboxing	WASM (zero ambient authority)	Coordinator-side eval (diff + eval command)
Anti-gaming	None visible	Canary probes, held-out split, temporal stability
Payment timing	Deferred (points accumulate)	Instant (hold invoice settles in <500ms)
Scale	2M+ nodes	Prototype (Phase 0)

The gap Hyperspace doesn’t fill: real payments for real quality. Points are speculative. Lightning payments are immediate, denominated in real money, and conditional on validated improvement. Hyperspace has built impressive P2P agent coordination infrastructure; l402-train has the economic layer that makes distributed optimization sustainable.

The WASM sandbox pattern from Hyperspace’s Autoskill domain is worth studying for bounty agent isolation — zero ambient authority (no filesystem, no network) is the right security model for executing untrusted agent code.

5. Karpathy’s AgentHub

On March 9, Karpathy shipped AgentHub — an agent-first collaboration platform with the tagline “GitHub is for humans. AgentHub is for agents.” It hit 2,000+ stars in under 24 hours before being pulled from GitHub. Forks preserve the complete codebase: ottogin/agenthub, ygivenx/agenthub.

Architecture: a single Go binary + SQLite database + bare git repo. Three layers:

Git layer — agents push code via git bundles. No main branch, no PRs, no merges — just a sprawling DAG of commits going in every direction. Agents can fetch any commit, browse the DAG, find children/leaves/lineage, diff between commits.
Message board — channels with threaded posts and replies for agent coordination. No imposed structure.
Auth + defense — per-agent API keys, rate limiting (100 pushes/hour), bundle size limits (50MB).

The initial commit was co-authored with Claude Opus 4.6. Karpathy filed PR #92 on the autoresearch repo proposing integration, mentioning an early deployment at autoresearchhub.com.

Why this matters for l402-train: AgentHub solves coordination (the DAG + message board) but not payments. There is no mechanism for compensating agents, validating contribution quality, or preventing free-riding. If Karpathy re-publishes and it becomes the default coordination layer, l402-train’s payment infrastructure is complementary — hold invoice escrow + L402 gating could sit on top of an AgentHub-style DAG. See the agent collaboration research for the full l402-hub architecture.

6. The Ecosystem

13+ derivative projects in the first week, spanning ML training, code optimization, distributed research, and general-purpose tooling:

Repo	Stars	What it does
karpathy/autoresearch	30.8K	Original — ML training on NVIDIA GPUs
karpathy/agenthub (pulled; forks: ottogin, ygivenx)	2K+	Agent-first collaboration — Go + SQLite + bare git DAG
davebcn87/pi-autoresearch	972	Tobi’s domain-agnostic pi plugin for autoresearch loops
hyperspaceai/agi	239	Distributed P2P autoresearch with GossipSub + CRDT
zkarimi22/autoresearch-anything	88	Template generator for any quantifiable metric (details below)
trevin-creator/autoresearch-mlx	—	Apple Silicon MLX port, tested on M4 Max
miolini/autoresearch-macos	—	macOS Metal port (MLX framework)
RightNow-AI/autokernel	—	Autoresearch for GPU kernels (PyTorch → Triton)
hwchase17/autoresearch-agents	—	Harrison Chase (LangChain) — optimizes agent code with LangSmith evals
drivelineresearch/autoresearch-claude-code	—	Port to Claude Code skills, demo on baseball biomechanics
christinetyip/autoresearch@home	—	Collaborative platform — 1,100+ experiments, 55 improvements in first 24 hours

autoresearch-anything

autoresearch-anything (88 stars, MIT) is a scaffolding tool that generates instructions for AI coding agents to run autonomous improvement loops on any project with a quantifiable metric. An interactive CLI asks ~12 questions (project description, mutable files, metric, eval command, constraints, timeout) and generates a setup.md file that tells the agent how to operate.

Why this matters: The generated setup.md — project description, mutable files, metric name/direction, eval command, constraints, timeout — maps almost exactly to the bounty specification format in our whitepaper §4.3. This validates our bounty format independently: the community converged on the same schema.

7. Covenant-72B Traction

The Covenant-72B paper (arXiv:2603.08163) — already thoroughly analyzed in our Covenant-72B research — has gained significant traction since publication. The tplr_ai team is actively working on Heterogeneous SparseLoCo for consumer hardware participation, which would extend the compression technique beyond uniform-GPU clusters. No material new technical information beyond what’s in our existing analysis.

8. The Payment Gap

Every project in this ecosystem has solved the experiment loop. Nobody has solved payments.

System	Coordination	Incentive	Payment
Karpathy autoresearch	Single agent	None (self-motivated)	None
pi-autoresearch	Single agent	None	None
autoresearch-anything	Single agent	None	None
autoresearch@home	Multi-agent, centralized	None visible	None
Hyperspace	P2P GossipSub	Points (speculative)	None (future token?)
Bittensor/Covenant	Blockchain consensus	TAO token	TAO (broken incentives)
l402-train	L402 HTTP + coordinator	Quality-proportional sats	Lightning hold invoices

Karpathy’s SETI@home vision — “asynchronously massively collaborative agents” — requires three things:

Experiment infrastructure — solved by the autoresearch pattern
Coordination — partially solved by Hyperspace (gossip), autoresearch@home (centralized), AgentHub (pulled but forks survive)
Payment — unsolved by everyone except l402-train

The autoresearch movement has proven demand (30.8K stars in 1 week). The distributed variants have proven multi-agent coordination works. What’s missing is the economic layer that makes it sustainable at scale: real payments, conditional on validated quality, settled instantly. This is exactly what l402-train’s hold-invoice escrow + L402 payment gating provides.

9. Key Takeaways

Market timing is excellent. 30.8K stars in 1 week means everyone wants automated optimization. The demand for autoresearch bounties is validated. Nobody has the payment infrastructure to make it a market.
MLX on Apple Silicon works. Community ports and our own Phase 0 results confirm MLX handles autoresearch-style experiment loops on consumer hardware. No NVIDIA GPU required for bounty participation.
The pattern transfers to any domain. Lutke’s Liquid result (53% faster on 20-year production code) proves autoresearch works beyond ML training. Every production system with a measurable metric is a potential bounty target.
WASM sandboxing is worth studying for bounties. Hyperspace’s Autoskill uses WASM with zero ambient authority for safe peer-to-peer code execution. This is the right isolation model for running untrusted bounty agent submissions.
Karpathy’s AgentHub is the coordination competitor to watch. A Go + SQLite + bare git DAG that hit 2K+ stars in <24h before being pulled. It solves coordination but not payments — l402-train’s hold invoice infrastructure is complementary.
autoresearch-anything validates the bounty template format. Its generated setup.md (project description, mutable files, metric, eval command, constraints) maps almost exactly to our bounty specification in §4.3.