Implementation Plan

Research prototype — proving the core thesis with real code, real payments, and real numbers.

Project Philosophy

This is a research prototype, not a production system. The goal is to prove the core thesis — "Lightning micropayments can coordinate quality-verified compute contributions" — with real code, real payments, and real numbers. Start small, validate incrementally, publish results.

Together AI proved decentralized training could work at meaningful scale — then abandoned it because centralized infrastructure was a better business. The thesis of this project is that Lightning micropayments change the equation: per-contribution payment granularity, near-zero transaction costs, and no token overhead make the economics work where token-based systems failed.

The protocol has two modes that share the same L402 infrastructure: training coordination (gradient exchange with quality-proportional payment) and autoresearch bounties (AI agents compete to optimize any quantifiable metric, paid per validated improvement). Training is the hard technical problem that proves the protocol. Autoresearch bounties are the scalable product — they require no GPU, run on any hardware, and have an essentially unbounded addressable market. Both are developed in parallel.

Two Tracks, Shared Infrastructure

	Track A: Training	Track B: Autoresearch
What	Decentralized model training with gradient exchange	AI agents compete to optimize anything with a metric
Hardware	GPU / Apple Silicon (16+ GB VRAM)	Any computer that can run a coding agent
Coordination	Synchronized ~70s rounds, SparseLoCo compression	Fully independent — agents never coordinate
Verification	Gradient quality scoring (loss delta)	Deterministic: did the held-out metric improve?
Shared infra	L402 payment gating, hold invoice escrow, coordinator validation, Lightning settlement
Phases	0 → 1 → 2 → 3	B0 → B1 → B2 (starts at Phase 1)

Track B starts as soon as Phase 1’s L402 infrastructure is working. The bounty coordinator is a simpler application of the same payment flow — no gradient compression, no model checkpoints, just "submit a diff, validate against held-out eval, pay for improvements." This means the autoresearch product can ship months before multi-peer training is battle-tested.

TRACK A: TRAINING

Phase 0: Local End-to-End Loop

Timeline: 2 weeks

Goal: Single-machine simulation running the complete protocol loop: local training → gradient compression → validation scoring → payment settlement. All on the MacBook with regtest Lightning.

Why this first: Before involving any networking, peers, or real money, prove the software architecture works end-to-end. Get a tight eval loop running fast.

Components

sparseloco.py — SparseLoCo compression in MLX
- Top-k sparsification (k=64 per chunk of 4096)
- 2-bit quantization of selected values
- Index encoding
- Error feedback buffer (decay=0.95)
- Test on Qwen2.5-0.5B. Train locally for 30 steps, compute pseudo-gradient (weight diff), compress, decompress, verify round-trip fidelity
- Metric: compression ratio achieved + loss degradation from compress/decompress round-trip vs dense gradient
validator.py — Gauntlet-style loss scoring
- Take compressed gradient, decompress, apply to model checkpoint
- Measure loss on held-out validation batch before and after
- Output: quality score (loss delta) normalized against baseline
- Pure function: f(checkpoint, gradient, val_data) → score. Deterministic replay is free
Regtest Lightning — Two LND nodes in Docker via lightning-agent-tools
- Coordinator node + simulated peer node (Tier 2 security — local keys, restricted perms)
- Create payment channel between them
- Test: issue hold invoice → pay → settle on validation pass / expire on fail

protocol_sim.py — Single-machine protocol loop

for round in range(N):
  1. Peer trains locally for 30 steps (MLX)
  2. Peer compresses pseudo-gradient (sparseloco.py)
  3. Coordinator issues hold invoice for reward
  4. Peer "submits" gradient (local function call, no HTTP)
  5. Coordinator validates (validator.py) → quality_score
  6. If quality_score > threshold: settle hold invoice
  7. Else: let hold invoice expire
  8. Log: round, quality_score, payment_settled, compression_ratio

Economic Benchmarking

Phase 0 also establishes baseline economics. Measure actual performance and power draw against the break-even analysis:

Training throughput (tok/s) on Mac Mini M4 Pro (target: 150–200 tok/s on 3B)
Real power draw during sustained training (target: 30–50W)
Sats/hr break-even at measured power (electricity-only target: 9 sats/hr)
Validation compute overhead as % of training compute (target: <5%)

Validates

SparseLoCo compression works on MLX (not just PyTorch/CUDA)
Validation oracle produces meaningful quality scores
Hold invoice conditional settlement works mechanically
Real numbers for: compression ratio, validation compute cost, payment latency
Economic viability: are the break-even numbers realistic?

Dependencies

Docker (for regtest LND nodes)
lightning-agent-tools repo (Docker Compose stack)
MLX + mlx-lm (already available)

Phase 1: L402-Gated HTTP Exchange

Timeline: 2 weeks

Goal: Split coordinator and peer into separate processes communicating over HTTP with L402 payment gating. Still on one machine, but real HTTP and real L402 flows.

Components

coordinator.py — FastAPI service behind Aperture proxy
- PUT /gradient — L402-gated gradient submission (peer pays submission fee)
- GET /checkpoint — L402-gated checkpoint download
- GET /reward-schedule — public endpoint showing current bounty rates
- Validation runs server-side after gradient upload
- Hold invoice issued at upload time, settled or expired based on validation score
peer.py — Client using lnget for automatic L402 payment
- Training loop → compress → lnget PUT /gradient → receive payment (or not)
- --max-cost flag enforces per-request spending caps
Aperture configuration
- Pricing: ~100 sats submission fee for PUT /gradient, ~50 sats for GET /checkpoint
- Macaroon caveats: per-peer spending limits, time-bounded sessions

L402 Ecosystem Notes

Aperture is LND-only — no CLN or LDK support (ecosystem survey). Fine for prototype, limits future peer implementations.
lightning-mcp-server provides 18 read-only monitoring tools (check balance, list channels, query invoices) — useful for coordinator observability.
Fewsats l402-python (pip install l402) is an alternative/supplement to lnget for Python-native peers.

Validates

L402 works for gradient exchange (the core protocol interaction)
Payment latency is acceptable within the 70-second training round window
lnget + Aperture stack works as described in whitepaper architecture

Phase 2: Two-Machine Proof of Concept

Timeline: 4 weeks

Goal: Run the protocol across two separate machines over the real internet with real (small) Lightning payments.

Components

Coordinator on Hetzner VPS
- Deploy coordinator service + LND (Neutrino light client) + Aperture
- Channel capacity: minimal for testing (100K–1M sats, ~$100–$1000)
Primary test peer: Mac Mini M4 Pro 24 GB
- MLX training, LND light client, direct payment channel to coordinator
- The sweet spot hardware: $799, 30–50W, 150–200 tok/s on 3B model
- Real Lightning payments: submit gradients, receive rewards
Stretch: RTX 4090 peer (CUDA path)
- PyTorch + CUDA training, validates cross-framework gradient exchange
- 500+ tok/s on 3B, 450W — tests the power/performance tradeoff
Testnet → Mainnet
- Start on Bitcoin testnet (free, no real money)
- Move to mainnet when stable (budget: ~$100–500)

Economic Validation

Measure real sats earned per hour per hardware tier
Compare to Vast.ai market rates (RTX 4090 hosts earn 158–243 sats/hr equivalent)
Calculate coordinator cost per peer per day at target payment rates
Answer: "At 200 sats/hr per peer, 100 peers = $14/hr. Is this sustainable for the training value produced?"

Validates

Protocol works over real internet
Real Lightning payment latency over real network hops
Gradient upload/download times at realistic bandwidth
Channel management and rebalancing with real channels
Economics: are actual sats/hr in the "worth my time" range (300+ sats/hr)?

Deliverable: conference demo

Phase 3: Multi-Peer Simulation + Byzantine Testing

Timeline: 4 weeks

Goal: Simulate 3–5 peers submitting varying quality gradients + 1 real peer on MacBook. Test incentive mechanics and Byzantine resistance.

Verification of untrusted computation is the hardest unsolved problem in decentralized training. Gensyn's Verde (probabilistic proof-of-learning) has been in development since 2022 and remains in testnet. Prime Intellect's TOPLOC works but is narrow (RL rollouts only). l402-train's approach — deterministic loss scoring on held-out data — is simpler and immediately testable, but must prove it catches real attack vectors.

Simulated Peer Profiles

Honest peer — real gradients from actual training
Free-rider — random/noise gradients (zero compute)
Plagiarist — copies another peer's gradient
Poisoner — adversarial gradients designed to degrade model
Mediocre — real gradients from undertrained model (low quality but honest)
Stale — submits gradients computed on an outdated checkpoint (desync attack from Gauntlet analysis)

Test Questions

Does quality-proportional payment correctly reward good and reject bad?
Do submission fees effectively prevent spam?
Does validation catch free-riders and poisoning?
What is the validation compute overhead relative to training?

Deliverable: technical paper with empirical results — real Lightning payments + real gradient validation + Byzantine resistance is novel. Nobody has demonstrated this.

TRACK B: AUTORESEARCH BOUNTIES

Phase B0: Bounty Runner Framework

Timeline: 2 weeks (parallel with Phase 1)

Goal: Build the bounty coordinator as a second mode of the existing coordinator service. Same L402 infrastructure, different task type.

Components

bounty_coordinator.py — FastAPI endpoints behind same Aperture proxy
- GET /bounties — public listing of active bounties
- GET /bounty/{id} — L402-gated baseline download (code + public eval set)
- POST /bounty/{id}/submit — submit improvement (diff + claimed score)
- Validation: apply diff to baseline, run eval on held-out set, score improvement
- Hold invoice created at submission, settled proportional to improvement
bounty_agent.py — Reference agent client
- Downloads bounty baseline via L402
- Runs autoresearch loop locally (Karpathy pattern: edit → eval → keep/discard)
- Submits improvements to coordinator
- Works with any coding agent backend (Claude Code, Codex, local models)

Why This Is Simpler Than Training

No gradient compression (SparseLoCo not needed — submissions are code diffs)
No model checkpoints (coordinator stores eval framework, not multi-GB models)
No synchronization (agents work independently, submit whenever ready)
Validation is running an eval script, not forward-pass loss computation
Same hold invoice escrow, same L402 gating, same coordinator architecture

Validates

L402 payment flow works for bounty submissions (not just gradient exchange)
Held-out validation catches naive metric gaming
Hold invoice economics make sense for bounty-scale payments (500–50,000 sats)

Phase B1: First Live Bounties

Timeline: 2 weeks (parallel with Phase 2)

Goal: Post real bounties with real sats, have real agents compete. Prove the two-sided market works.

First Bounties

Prompt optimization — improve a classification system prompt against a labeled eval corpus. Clear metric (accuracy), fast eval (<30s), bounty: 50,000–100,000 sats
Regex pattern improvement — improve detection patterns against a test corpus. Composite metric (detection_rate × 0.7 + (1 - false_positive_rate) × 0.3), bounty: 25,000–50,000 sats
Open bounty — any target with a quantifiable metric and fast eval (<5 minutes). Posted publicly to attract external agents

Anti-Gaming Validation

80/20 public/held-out eval split with commit-reveal on held-out set hash
Canary probes in public eval set (known-answer inputs that differ in held-out)
Temporal stability: 20% holdback released after 48-hour re-evaluation
Diff size limits to prevent wholesale file replacement

Validates

Real agents can discover and compete for bounties
Anti-gaming measures catch metric hacking in practice
Bounty economics: are improvements worth the sats paid?
Agent diversity: do different agents find different improvements?

Deliverable: working bounty marketplace with real payments — standalone product, no GPU required.

Phase B2: Multi-Sponsor Marketplace

Timeline: 4 weeks

Goal: Open the bounty coordinator for external sponsors to post their own bounties. Two-sided marketplace: sponsors post bounties, agents compete.

Components

Sponsor onboarding
- Sponsor deposits bounty pool via Lightning (held in coordinator channel)
- Uploads target files, eval script, public eval dataset
- Coordinator generates held-out eval set or accepts sponsor-provided held-out hash
Public bounty board
- Browse active bounties with: description, metric, bounty amount, deadline, current best score
- Leaderboard per bounty (anonymized agent IDs + scores)
- Historical data: completed bounties, total sats paid, average improvements
Coordinator economics
- 5–10% fee on bounty payouts (covers validation compute + infrastructure)
- L402 access fees on baseline downloads (covers bandwidth)
- Self-sustaining business model independent of training revenue

Deliverable: open-source bounty marketplace — the "SETI@home for software optimization" that Karpathy envisioned, coordinated by Lightning.

Target Hardware

Training hardware requirements are based on the consumer hardware guide and economics analysis. Autoresearch bounties have no minimum hardware — any computer that can run a coding agent (Claude Code, Codex, or a local model) can compete.

Tier	Hardware	Model Range	tok/s (3B)	Power	Break-even*
Entry	MacBook Air M3 16 GB	0.5B–1B	40–60	20 W	5 sats/hr
Sweet spot	Mac Mini M4 Pro 24 GB	0.5B–7B	150–200	40 W	9 sats/hr
Workhorse	Mac Studio M2 Ultra 192 GB	0.5B–30B	~475	90 W	21 sats/hr
Power	RTX 4090 system (24 GB)	0.5B–13B	500–628	450 W	103 sats/hr
Not viable: Raspberry Pi, AMD RX 580 and older, 8 GB machines

*Electricity-only break-even at US average $0.16/kWh, BTC = $70,000

Competitive Landscape

Based on the landscape survey of 12 projects:

What exists: Only Prime Intellect (INTELLECT-1/2/3) and Together AI (GPT-JT, before pivoting) have trained competitive models via decentralized infrastructure. Bittensor is an inference marketplace with empirically demonstrated stake-weighted rewards. Gensyn has been in testnet for 3+ years. Every project except Hivemind requires a custom token.

Where l402-train fits: The only protocol using Bitcoin Lightning for payment coordination. No token, no staking, quality-proportional rewards via hold invoices. The tradeoff is starting with a single coordinator and small models (0.5B–3B), which is the honest scope for a research prototype. See the L402 ecosystem survey for how the protocol extends L402 bidirectionally.

What to Skip for Prototype

Whitepaper Feature	Skip?	Why
DLC-bound settlement	Yes	Hold invoices sufficient for PoC
Federated multi-validator	Yes	Single coordinator fine; deterministic replay is what matters
72B scale	Yes	0.5B–3B on MLX. Proving the mechanism, not training a model
Heterogeneous SparseLoCo	Yes	Single-tier peers only
USDT (Taproot Assets)	Yes	Sats-only for prototype

Key Risks

SparseLoCo on MLX — No existing implementation. Top-k + quantization straightforward; error feedback buffer management is the hard part
Aperture custom validation — L402 gating is supported, but "validate before settling hold invoice" may need to be handled outside Aperture
LND on VPS — 4GB RAM may be tight alongside existing services. May need to run LND on MacBook instead
MLX scale gap — 0.5B proof of concept is fine, but gap to publishable 7B+ results requires renting GPU time

Deliverables Summary

Phase	Track	Deliverable	Publishable?
0	Training	Single-machine simulation with economics data	No — but provides all the numbers
1	Training	L402-gated gradient exchange	Blog post / tweet thread
B0	Autoresearch	Bounty runner framework	Blog post / tweet thread
2	Training	Two-machine PoC over real internet	Conference demo
B1	Autoresearch	First live bounties with real sats	Open-source product launch
3	Training	Multi-peer + Byzantine resistance	Technical paper with empirical results
B2	Autoresearch	Multi-sponsor bounty marketplace	Standalone product