Autoresearch Bounties: Use Cases, Integration, and Market Opportunity

Date: 2026-03-13 · Scope: Autoresearch concept and origin, concrete use cases across 12 domains, integration with l402-train bounty protocol, economics, anti-gaming, comparison to AutoML/Kaggle/Bittensor


1. The Autoresearch Pattern

In March 2026, Andrej Karpathy released autoresearch — a 630-line Python script that gives an AI coding agent a small but real LLM training setup and lets it experiment autonomously. The pattern is simple:

  1. Read the mutable source file
  2. Hypothesize an improvement (change learning rate, modify architecture, adjust data pipeline)
  3. Edit the code
  4. Evaluate against a fixed metric (validation bits per byte in Karpathy’s case)
  5. Keep if improved (git commit), discard if not (git reset)
  6. Repeat

The system achieves approximately 12 experiments per hour — roughly 100 experiments overnight on a single GPU. One GPU, one file, one metric.

The Three-File Architecture

The design is deliberately constrained to three files:

  • prepare.pyimmutable. Data preparation, the evaluation function (evaluate_bpb()), and utilities. The agent cannot touch this file.
  • train.pythe sole mutable file. Architecture, optimizer, hyperparameters, batch sizes. This is what the agent experiments on.
  • program.md — human-authored instructions for the agent. Defines objectives and constraints. Includes the autonomy rule: “Once the experiment loop has begun, do NOT pause to ask the human if you should continue… The human might be asleep… You are autonomous.”

The critical design principle is the frozen metric: the evaluation function lives in an immutable file. “A system that can rewrite both the exam and the answers will always pass.” The metric (val_bpb — validation bits per byte) is vocabulary-size-independent, so architectural changes are fairly compared. This principle maps directly to l402-train’s bounty design, where the held-out evaluation set is controlled by the sponsor, not the agent.

Results

Karpathy left the agent running for two days on a depth-12 model. It processed ~700 autonomous changes, found roughly 20 additive improvements, and stacking these changes dropped the “Time to GPT-2” metric from 2.02 to 1.80 hours — an 11% efficiency gain. Critically, discoveries at depth-12 proxy transferred to depth-24 production models, meaning the small-scale experiments produced transferable knowledge.

The post garnered 8.6 million views in two days. The reaction was immediate and practical:

  • Shopify CEO Tobi Lutke applied the pattern to a production 0.8B model overnight, achieving +19% quality improvement — beating a prior 1.6B model after just 37 experiments. Lutke generalized: “Autoresearch works even better for optimizing any piece of software.”
  • autokernel (RightNow AI) applied it to GPU kernel optimization, running ~40 experiments/hour
  • HFT firms reported agents discovering “techniques I understand to be proprietary” through overnight experimentation
  • API latency reduction of 40% achieved overnight in production systems
  • Marketing optimization (Eric Siu, Single Grain): replaced training script with landing page, measured positive reply rate. “The companies that win won’t have better marketers, they’ll have faster experiment loops.”

Notably, Lutke’s result — a 0.8B model beating a 1.6B model after overnight optimization — demonstrates that autoresearch makes small models on cheap hardware competitive with large models on expensive hardware. This aligns perfectly with l402-train’s target of consumer hardware participation.

Karpathy’s Vision

Karpathy frames the shift bluntly: “Frontier AI research used to be done by meat computers in between eating, sleeping, having other fun… That era is long gone. Research is now entirely the domain of autonomous swarms of AI agents.” Humans shift from experimenting to designing experiments — programming program.md files rather than Python.

His SETI@home analogy frames the coordination challenge: “asynchronously massively collaborative agents … the goal is not to emulate a single PhD student, it’s to emulate a research community.”

This is explicitly a coordination problem. Scaling from one agent on one GPU to hundreds of competing agents exploring different approaches simultaneously requires coordination infrastructure: bounty publication, agent discovery, submission management, validation, and payment. This is exactly what l402-train’s coordinator + L402 + hold invoice infrastructure provides.


2. Use Cases

The autoresearch pattern works on anything with a quantifiable metric. Below are concrete use cases organized by domain, with the metric being optimized, the mutable target, and example bounty economics.

2.1 ML Model Optimization

Metric: validation loss, accuracy, F1 score, inference speed
Target: training scripts, model architecture, hyperparameters

This is where autoresearch was born. An AI agent can optimize learning rate schedules, experiment with architecture modifications (layer sizes, attention patterns, activation functions), tune data preprocessing pipelines, and optimize quantization parameters for deployment.

Real result: Karpathy’s 11% training efficiency gain, Lutke’s +19% quality improvement.

Example bounty: “Reduce val_bpb of this 0.5B model training script by 5% within 5-minute training runs. 50,000 sats available.”

2.2 Code Performance Optimization

Metric: execution time, throughput, memory usage, p99 latency
Target: hot paths, algorithms, data structures, SQL queries

Production software has measurable performance characteristics. An agent can profile and optimize hot code paths, replace algorithms with faster alternatives, optimize memory allocation patterns, rewrite SQL queries for better execution plans, and tune thread pool sizes, buffer sizes, and cache parameters.

Real result: 40% API latency reduction achieved overnight. autokernel running ~40 kernel optimizations/hour.

Example bounty: “Reduce p99 latency of this API endpoint from 200ms to under 150ms. Test suite must pass. 100,000 sats.”

2.3 GPU Kernel Optimization

Metric: FLOPS utilization, kernel execution time, memory bandwidth
Target: CUDA/Metal/Triton kernels

GPU kernels are pure performance targets with clear metrics. An agent can optimize thread block dimensions and shared memory usage, experiment with memory coalescing patterns, tune occupancy and instruction-level parallelism, and explore fusion opportunities between sequential kernels.

Real result: autokernel achieves ~40 experiments/hour, systematically exploring the optimization space.

Example bounty: “Improve this attention kernel throughput by 10%+ on A100. Correctness tests must pass. 200,000 sats.”

2.4 Prompt Engineering

Metric: task accuracy, consistency score, token efficiency, cost per correct answer
Target: system prompts, few-shot examples, output format instructions

Prompt engineering is mostly trial and error today. An agent can systematically vary instruction phrasing and measure accuracy, optimize few-shot example selection, reduce prompt token count while maintaining performance, tune output format constraints for downstream parsing reliability, and A/B test chain-of-thought vs. direct prompting strategies.

Example bounty: “Improve classification accuracy of this prompt from 78% to 85%+ on eval set. Max 2,000 tokens. 30,000 sats.”

2.5 Compiler and Build Optimization

Metric: binary size, compilation time, runtime performance
Target: compiler flags, build configurations, link-time optimization settings

Build systems have hundreds of tunable parameters. An agent can explore compiler flag combinations (-O2 vs -O3, LTO, PGO), optimize include hierarchies to reduce compilation time, tune linker settings for binary size vs. performance, and experiment with precompiled header configurations.

Example bounty: “Reduce this project’s clean build time from 8 minutes to under 6 minutes without degrading runtime performance. 75,000 sats.”

2.6 Database and Query Optimization

Metric: query execution time, throughput (QPS), storage efficiency
Target: schema design, indexes, query plans, configuration parameters

Database performance is highly tunable and measurable. An agent can optimize index selection and composite index design, rewrite queries for better plan generation, tune database configuration parameters (buffer pool, WAL settings, connection pool), and improve materialized view definitions.

Example bounty: “Reduce average query latency on this test workload by 20%. Schema changes allowed, data integrity tests must pass. 150,000 sats.”

2.7 Test Suite Optimization

Metric: coverage percentage, execution time, mutation score
Target: test files, test configuration, test data

Test suites can be slow and incomplete. An agent can generate tests to increase coverage of uncovered branches, optimize test execution order for faster failure detection, reduce test suite runtime through parallelization and fixture optimization, and improve mutation testing scores by adding discriminating tests.

Example bounty: “Increase branch coverage from 72% to 85%+ with execution time under 60 seconds. 40,000 sats.”

2.8 Infrastructure and DevOps

Metric: deployment time, resource utilization, cost per request
Target: container configs, CI/CD pipelines, infrastructure-as-code

Infrastructure has clear performance metrics. An agent can optimize Docker image sizes and build times (multi-stage builds, layer ordering), tune Kubernetes resource requests/limits for cost efficiency, optimize CI/CD pipeline parallelism and caching, and reduce cloud infrastructure costs through right-sizing.

Example bounty: “Reduce this Docker image from 1.2 GB to under 500 MB. All integration tests must pass. 50,000 sats.”

2.9 Data Pipeline Optimization

Metric: throughput (records/sec), latency, resource cost
Target: ETL scripts, stream processing configs, batch job parameters

Data pipelines process measurable workloads. An agent can optimize batch sizes and parallelism settings, improve serialization/deserialization performance, tune partition strategies for better data locality, and optimize window and watermark configurations in stream processing.

Example bounty: “Improve this ETL pipeline throughput from 10K to 15K+ records/sec on the test dataset. Output must match reference. 80,000 sats.”

2.10 Energy and Resource Efficiency

Metric: energy per operation (J/inference, W/throughput), carbon intensity
Target: power management configs, scheduling algorithms, hardware utilization

Sustainability metrics are increasingly measurable. An agent can optimize batch scheduling to reduce idle GPU power draw, tune DVFS (dynamic voltage/frequency scaling) parameters, improve workload placement for thermal efficiency, and optimize sleep/wake patterns for intermittent workloads.

Example bounty: “Reduce energy per inference from 0.8 J to under 0.6 J on this model. Throughput must not decrease. 60,000 sats.”

2.11 Security Hardening

Metric: vulnerability count, fuzzing coverage, time-to-exploit
Target: configuration files, security policies, code patterns

Security has quantifiable metrics from scanners and fuzzers. An agent can reduce static analysis warnings through code fixes, improve fuzzing coverage percentage, harden configurations against CIS benchmarks, and optimize security headers and CSP policies. BountyBench (NeurIPS 2025) showed AI agents can already detect, exploit, and patch real-world vulnerabilities.

Example bounty: “Reduce SAST findings from 47 to under 10 without breaking tests. 120,000 sats.”

2.12 Scientific and Research Applications

Metric: domain-specific (binding affinity, yield prediction, simulation accuracy)
Target: model parameters, simulation configs, analysis scripts

Any computational science workflow with measurable output is a candidate: drug discovery (molecular property prediction), materials science (simulation parameter tuning), climate modeling (parameterization schemes), bioinformatics (sequence alignment scoring). Self-driving labs are already demonstrating this at scale — Sandia ran 300 experiments in 5 hours on metasurface light emission, and fully autonomous synthesis of 29 organosilicon compounds (8 previously unknown) was reported in Nature in 2024.

Example bounty: “Improve binding affinity prediction RMSE from 1.2 to under 1.0 on the test set. 500,000 sats.”


3. Integration with l402-train

The l402-train protocol’s existing infrastructure — L402 payment gating, hold invoice escrow, coordinator validation, peer discovery — maps directly to autoresearch bounties.

3.1 Shared Infrastructure

ComponentTraining UseBounty Use
L402 endpointSubmit compressed gradientsDownload bounty materials + submit improvements
Hold invoicesLock payment during gradient validationLock payment during held-out evaluation
CoordinatorValidate gradient qualityRun held-out eval, check for gaming
Payment scalingProportional to loss improvementProportional to metric improvement
Peer discoveryFind training tasksFind available bounties

3.2 Bounty Lifecycle

  1. Sponsor creates bounty via coordinator API:
    • Target files (git repo or tarball)
    • Eval command + public eval dataset
    • Held-out eval set (hash committed, data secret)
    • Total sats available + payment schedule + deadline
    • Constraints: max diff size, required tests, forbidden patterns
  2. Agents discover and download via L402-gated endpoint:
    • Pay small access fee (anti-spam + bandwidth cost)
    • Receive baseline code + public eval framework
    • Run autonomous experiments locally
  3. Agents submit improvements:
    • Code diff + claimed score on public eval
    • Hold invoice created (payment locked)
  4. Coordinator validates:
    • Apply diff to baseline
    • Run eval on held-out dataset (not public)
    • Check for gaming: canary probes, distribution shift, temporal stability
    • Score improvement magnitude
  5. Payment settles:
    • 80% immediate: payment = bounty_pool × (improvement / target)
    • 20% holdback: released after 24–72 hour temporal stability check
    • Auto-refund on timeout if validation fails

3.3 Why This Works Better Than Training

Training requires synchronized gradient exchange every ~70 seconds across all peers. Bounties are embarrassingly parallel — agents work independently, submit independently, and get paid independently. This means:

  • No synchronization overhead. Agents can take minutes, hours, or days per experiment
  • No hardware minimum. Any computer that can run a coding agent participates
  • Simpler validation. “Did the metric improve?” is a binary question with a deterministic answer
  • Natural market pricing. Sponsors set bounty amounts; agents self-select based on profitability
  • Immediate utility. Every accepted improvement directly helps the sponsor’s production system
  • The scarce resource shift. Execution has become cheap; judgment and verification are now the scarce resource. The coordinator sells verification, not compute — a sustainable role as agent capabilities increase

4. Anti-Gaming

Goodhart’s law — “when a measure becomes a target, it ceases to be a good measure” — is the primary risk. Agents will optimize the metric, not necessarily improve the system.

4.1 Attack Vectors

  • Overfitting to public eval: Agent memorizes or reverse-engineers the eval set
  • Metric gaming: Improve the measured metric while degrading unmeasured quality
  • Canary exploitation: Detect and hardcode answers to known test inputs
  • Adversarial patches: Small changes that exploit eval harness bugs
  • Plagiarism: Copy improvements from other agents’ submissions

4.2 Defenses

Held-out evaluation is the primary defense. The sponsor evaluates on secret data not available to agents. The held-out set hash is committed at bounty creation (commit-reveal scheme) to prevent coordinator manipulation.

Multi-metric composite scoring requires improvement across multiple independent metrics simultaneously. An agent gaming latency while degrading correctness gets caught.

Canary probes embed known-answer inputs in the public eval set with different answers in the held-out set. Agents that hardcode canary answers are detected.

Temporal stability checks re-evaluate improvements after 24–72 hours. Fragile or overfitted optimizations that don’t hold up lose the 20% holdback.

Diff size limits prevent agents from replacing the target file entirely. Maximum diff size is set by the sponsor.

Semantic review for top-N improvements by the sponsor before final holdback release. Automated validation handles the common case; human review catches sophisticated gaming.

Dynamic benchmarking (inspired by LiveBench): rotate held-out sets periodically, creating fresh unpublished test cases after the agent’s knowledge cutoff. Unlike static held-out sets that slowly leak information through repeated optimization, dynamic benchmarks make contamination structurally impossible.

The frozen metric principle from Karpathy’s design applies directly: the evaluation function must be immutable and controlled by the sponsor, not the agent. In l402-train, the coordinator runs evaluation on the held-out set — the agent never sees or controls the scoring function. As the Hybrid Horizons analysis puts it: “A system that can rewrite both the exam and the answers will always pass.”

4.3 Economic Alignment

The protocol’s economic structure naturally discourages gaming:

  • Small per-experiment payouts mean the effort of sophisticated gaming rarely exceeds the reward
  • The 20% holdback creates a reputation incentive for genuine improvements
  • Multiple competing agents mean any gaming strategy is quickly discovered and reported
  • Sponsors who detect gaming can update held-out sets and re-run validation

5. Economics and Market Opportunity

5.1 Agent Economics

Running a coding agent overnight (100 experiments) costs approximately:

Agent TypeCost per NightMarginal Cost
Cloud API (Claude Code, Codex)$10–50Per-token API pricing
Local model (consumer hardware)$0.03–0.12Electricity only (20–90 W × 8 hrs)

Expected hit rate: ~3% (roughly 20 improvements per 650 experiments, per Karpathy’s results). An agent running local models with near-zero marginal cost that finds one improvement per night worth 5,000–50,000 sats ($3.50–$35) is profitable from the first improvement. This creates a structural cost advantage for distributed participants on consumer hardware — exactly the dynamic the protocol is designed for.

5.2 Sponsor Economics

A company posting a 100,000 sat (~$70) bounty gets overnight distributed optimization equivalent to a contractor spending days at $50–150/hour. The key difference: payment is only for validated improvements. The sponsor’s downside is capped at the bounty amount.

ApproachCostTimelineGuarantee
Senior engineer$200–600 (4–8 hrs)1–2 daysNone
Contractor$400–1,2002–5 daysNone
Autoresearch bounty$35–70 (50K–100K sats)OvernightPay only for validated improvements

5.3 Market Size

The addressable market is essentially every piece of software with measurable performance:

  • Cloud infrastructure optimization: $100B+ annual spend. Even 1% improvement = $1B+ value
  • ML training efficiency: $50B+ annual spend on cloud compute for training
  • Database optimization: Every company with a database has slow queries
  • Build system optimization: Developer time waiting for builds is measurable and expensive
  • Prompt engineering: Every LLM application has prompts that could be better

The key insight: autoresearch bounties convert diffuse optimization demand into a liquid market. Today, a company with a slow API endpoint either ignores it or assigns an engineer. With bounties, they post a 100K sat bounty and get 50 agents competing overnight.

Coordinator Economics

The coordinator takes a 5–10% fee on bounty payouts for hosting, validation compute, and held-out dataset management. At scale:

  • 1,000 active bounties × 50K sats average payout × 7.5% fee = 375,000 sats/day (~$260/day)
  • Revenue scales linearly with market activity
  • Multiple competing coordinators prevent monopoly pricing

6. Comparison to Alternatives

Autoresearch BountiesAutoML PlatformsKaggleBug BountiesBittensorFreelance
ScopeAny quantifiable metricML models onlyML models onlySecurity onlyGeneral “be useful”Anything
IdentityNone requiredAccount requiredAccount requiredAccount requiredWallet + stakeFull KYC
Payment modelPer validated improvementSubscriptionWinner-take-allPer vulnerabilityToken emissionsPer project
ValidationDeterministic (held-out eval)Platform-internalLeaderboardManual triageOpaque consensusManual review
Settlement<500 ms (Lightning)Invoice/subscriptionWeeksWeeks–months~12 s consensusDays–weeks
TimelineOvernightHoursMonthsOngoingOngoingDays–weeks
MicropaymentsYes (500–500K sats)NoNoNo (>$100 min)Sort of (TAO)No
Agents as participantsFirst-classN/A (platform runs)EmergingEmergingYesNo

Key Differentiators

vs. AutoML: AutoML platforms (DataRobot, SageMaker, H2O) are closed systems, ML-specific, and subscription-based. You pay whether improvements are found or not. Autoresearch bounties are open, general-purpose, and pay-for-results.

vs. Kaggle: Kaggle competitions run for months with winner-take-all prizes and manual entry. Autoresearch bounties settle overnight with proportional payment and are designed for AI agents, not humans.

vs. Bug bounties: HackerOne and Bugcrowd pay for verified security vulnerabilities. Similar model, but limited to security. Autoresearch bounties generalize to any quantifiable metric. Verification is also simpler — running an eval script is easier than confirming a vulnerability.

vs. Bittensor: Bittensor uses stake-weighted token emissions with opaque validator consensus. l402-train uses quality-weighted sats with deterministic evaluation. No tokens, no staking, no identity required.


7. Key Takeaways

  1. Autoresearch is proven. Karpathy’s 11% training efficiency gain and Lutke’s +19% quality improvement demonstrate the pattern works. autokernel and production API optimization confirm it generalizes beyond ML training.
  2. The use cases are essentially unbounded. Anything with a quantifiable metric — code performance, ML training, prompts, infrastructure, databases, build systems, security, scientific research — can be a bounty target. Twelve concrete domains are documented here; more exist.
  3. Lightning micropayments are a natural fit. Permissionless participation, hold invoice escrow, micropayment granularity, and instant settlement solve coordination problems that traditional payment systems can’t. You can’t use Stripe to pay 500 sats for a 0.3% latency improvement.
  4. l402-train’s existing infrastructure covers 90% of what bounties need. L402 gating, hold invoices, coordinator validation, and peer discovery are already designed for the training use case and transfer directly.
  5. The economics favor distributed participation. Agents on consumer hardware (near-zero marginal cost) have a structural advantage over cloud API agents, creating exactly the distributed network the protocol is designed for.
  6. Anti-gaming is solvable. Held-out evaluation, multi-metric scoring, temporal stability checks, and canary probes provide defense in depth. The economic structure (small per-experiment payouts, 20% holdbacks) makes sophisticated gaming unprofitable.
  7. This is the scalable product. Training is the hard technical problem with limited hardware participation. Autoresearch bounties run on any computer, serve any domain, and create a two-sided market with organic demand. The addressable market is every piece of software with a measurable performance characteristic.