Federated Learning vs. Decentralized Training: Where l402-train Fits

Date: 2026-03-12 · Scope: Federated learning, decentralized training, DiLoCo, gradient privacy, l402-train positioning


1. Federated Learning: The Google Approach

If you've heard the phrase "AI training without sharing data," you've probably heard about federated learning. Google pioneered it, and the clearest example is your phone keyboard.

1.1 How It Works

Federated learning flips the usual machine learning setup. Instead of collecting everyone's data into one place and training a model on it, the model travels to the data:

  1. A central server sends the current model to thousands of devices (phones, tablets, edge hardware).
  2. Each device trains locally on its own data — your typing patterns, your photos, your health metrics. The raw data never leaves the device.
  3. Each device computes a model update (the mathematical adjustments that make the model slightly better) and sends only that update back to the server.
  4. The server averages all the updates together, producing a new, improved model.
  5. Repeat.

The privacy benefit is real: Google never sees what you typed. They only see the aggregated mathematical adjustments from millions of users, which (in theory) don't reveal any individual's data.

1.2 Where It's Used

  • Google Gboard: Next-word prediction, autocorrect, and smart compose. Your keyboard learns from your typing without sending your keystrokes to Google. All Gboard neural network language models now ship with differential privacy guarantees.
  • Apple on-device learning: Siri suggestions, QuickType keyboard, photo recognition. Apple processes training locally and uses differential privacy before sending any updates.
  • Healthcare: Hospitals train models on patient records without sharing protected health information across institutions.
  • Finance: Banks collaborate on fraud detection models without exposing customer transaction data.

1.3 The Trust Model

Here is the critical detail: federated learning still has a central orchestrator. Google (or Apple, or whoever runs the server) controls:

  • Which model gets trained
  • Which devices participate in each round
  • How updates are aggregated
  • When training starts and stops
  • What the model is used for after training

The participants (your phone) have no say in any of this. You trust the orchestrator to behave honestly — to actually aggregate fairly, to not extract information from your updates, to not modify the model in adversarial ways.

This is a trust-the-operator model. The data stays local, but the power stays centralized.

1.4 Limitations

Single point of failure. The central server is a bottleneck and a target. If it goes down, training stops. If it's compromised, the entire model is at risk.

No payment, no incentive. Participants contribute compute (battery life, processing time) and get nothing in return except a marginally better keyboard. There's no mechanism for compensating contributors, which limits participation to devices the operator already controls.

Communication costs. Sending full model updates from millions of devices is expensive. Techniques like gradient compression help, but the architecture inherently requires frequent communication with the central server.

Heterogeneity problems. Devices have wildly different compute capabilities, data distributions, and availability. Some phones are powerful; some are old. Some users type a lot; some barely use the keyboard. The central server must handle all of this.

Governance vacuum. Who decides what model gets trained? What data is appropriate? Who audits the aggregation? In federated learning, the answer is always: the operator. There's no governance mechanism for participants.


2. Decentralized Training: No Central Controller

Decentralized training takes a fundamentally different approach. Instead of a central server orchestrating everything, independent machines collaborate as peers to train a shared model. No single entity controls the process.

2.1 DiLoCo (Google DeepMind)

DiLoCo — Distributed Low-Communication training — is the most important recent advance in decentralized training. Published by Google DeepMind in late 2023 and extended through 2025, it answers a simple question: what if each machine trained on its own for a while, then they all compared notes?

How It Works (Plain Language)

Imagine a study group preparing for an exam. In the traditional approach (synchronous training), everyone reads the same page together, discusses it, then moves to the next page. Nobody can read ahead. If one person is slow, everyone waits.

DiLoCo works like this instead:

  1. Everyone gets a copy of the textbook (the model).
  2. Everyone goes home and studies independently for a week (H inner steps — typically 30–500 gradient updates using AdamW optimizer).
  3. At the end of the week, everyone meets up and shares what they learned — not their notes, but a compressed summary of how their understanding changed (a pseudo-gradient: the difference between their updated model and the starting model).
  4. The group combines these summaries using a specific technique (Nesterov momentum) to produce an updated shared understanding.
  5. Everyone takes the updated textbook home and studies for another week.

The key insight: you don't need to communicate every step. Training locally for many steps, then syncing, works almost as well as syncing after every single step — but uses 500x less communication bandwidth.

Why This Matters

DiLoCo on 8 workers matches the performance of fully synchronous training while communicating 500 times less. The 2025 follow-up, "Streaming DiLoCo," further improves efficiency at the 1B, 10B, and 100B parameter scale by overlapping communication with computation.

DiLoCo is also resilient: if a worker drops out mid-round, training continues. If a new worker joins, it can sync up and start contributing. This is exactly the property you need for training across unreliable internet connections with unknown participants.

2.2 Hivemind

Hivemind is an open-source PyTorch library for decentralized deep learning across the internet. It was built specifically for the scenario federated learning ignores: training one large model across hundreds of volunteers with unreliable computers communicating over the internet.

Key properties:

  • No central server. Peers coordinate using a distributed hash table (DHT) built on libp2p.
  • Fault tolerant. Any peer can fail or leave at any time. Training continues.
  • Heterogeneous. Peers can have different hardware, bandwidth, and availability.
  • Volunteer-driven. Hivemind was used to train a text-to-image model across volunteer compute.

The limitation: Hivemind doesn't have a built-in incentive or payment mechanism. Volunteers contribute out of goodwill, which limits the scale and reliability of participation.

2.3 Covenant-72B and SparseLoCo

The most dramatic demonstration of decentralized training is Covenant-72B — a 72-billion parameter language model trained by 70+ anonymous peers over the internet using the Bittensor network. The core innovation is SparseLoCo (Sparse Low-Communication optimization), a variant of DiLoCo that achieves 146x gradient compression:

  • Temporal compression: Sync every 30 steps instead of every step (~30x reduction)
  • Spatial compression: Keep only the 64 largest values out of every 4,096 (1.56% density, ~64x reduction)
  • Bit compression: 2-bit quantization + 12-bit position indices (~2.3x reduction)

Result: a 72B model's gradient shrinks from ~290 GB to ~2 GB per sync round. At 110 Mb/s uplink, that's 70 seconds of transfer time against 20 minutes of computation — 94.5% compute utilization.


3. Comparison Table

Dimension Federated Learning Decentralized (DiLoCo/Hivemind) l402-train
Who orchestrates Central server (Google, Apple) No single coordinator (DHT, peer-to-peer) Coordinator service, but replaceable and trust-minimized
Where data lives On each device; never leaves On each peer; never leaves On each peer; never leaves
Who decides what to train The operator Whoever initiates the run Whoever posts a training bounty
Trust model Trust the operator completely Trust the protocol / open-source code Trust-minimized: hold invoices, deterministic validation, coordinator is replaceable
Payment None. Participants volunteer for free None (Hivemind) or blockchain tokens (Bittensor) Bitcoin via Lightning. Sats per validated gradient
Verification Operator-defined; not transparent Varies: reputation or on-chain scoring Deterministic, replayable: loss improvement on held-out data
Who can participate Only devices the operator controls Anyone (permissionless) Anyone with compute and a Lightning wallet
Communication Full model updates, every round Compressed pseudo-gradients, every H steps SparseLoCo-compressed pseudo-gradients + L402 gating
Fault tolerance Server is single point of failure Resilient to peer dropout Resilient; hold invoices auto-cancel on timeout
Open/closed Closed. Operator's infrastructure Open source (Hivemind), semi-open (Bittensor) Open protocol, open source

4. Privacy Implications

Distributing training across many machines introduces privacy questions that don't exist in centralized training. When you share model updates — even compressed ones — you're sharing information derived from your training data. How much can an attacker learn from those updates?

4.1 Gradient Leakage Attacks

Gradient leakage (also called gradient inversion) is a class of attacks where an adversary reconstructs the original training data from shared gradient updates. The attack works by:

  1. Observing the gradient update a participant sends.
  2. Creating fake "dummy" data.
  3. Computing what gradient the dummy data would produce.
  4. Iteratively adjusting the dummy data until its gradient matches the observed gradient.
  5. When the gradients match, the dummy data approximates the real training data.

This is not theoretical. Research has shown that images, text, and medical records can be reconstructed from gradients with alarming fidelity. A 2025 study on retinal images found that over 92% of participants in a training set could be identified from reconstructed retinal vessel structures alone — even with some privacy defenses in place.

The practical constraint: these attacks work best when the attacker can observe a single participant's gradient computed on a small batch of data. In real deployments with large batches, multiple local training steps, and many participants, the attacks become much harder.

4.2 Model Inversion Attacks

Model inversion is a broader category: given access to a trained model (not the gradients, but the final model itself), an attacker tries to reconstruct representative examples of the training data. Recent work (2024–2025) has shown:

  • Stepwise gradient inversion (SGI) significantly improves reconstruction quality using evolutionary algorithms.
  • Geminio exploits vision-language models to reshape loss surfaces and extract specific targeted samples.
  • Activation inversion attacks (AIA) can reconstruct training text from activations shared in decentralized training — not just from gradients.
  • Multilingual models are more vulnerable to inversion attacks than monolingual ones.

The arms race between attack and defense is ongoing and active.

4.3 How Gradient Compression Helps Privacy

Here's where it gets interesting for l402-train. SparseLoCo-style compression doesn't just reduce bandwidth — it also reduces privacy leakage.

Why sparse gradients leak less:

When you keep only 1.56% of gradient values (64 out of every 4,096), you're discarding 98.44% of the information an attacker needs to reconstruct your data. The attacker sees which positions had the largest changes and their approximate magnitudes, but nothing about the other 98.44% of the model's response to your data.

Research confirms this: compressed SGD offers stronger resistance to gradient inversion attacks than uncompressed gradients. Gradient pruning (zeroing out small-magnitude values) specifically disrupts the reconstruction process because the large-magnitude gradients that survive pruning are the ones most important for reconstruction, but without the context of the smaller values, reconstruction quality degrades significantly.

The limits: Compression is not a complete defense. A 2024 paper (Deep Leakage from Compressed Gradients, DLCG) showed that recognizable images can still be reconstructed from gradients with 80–90% sparsity. At SparseLoCo's 98.44% sparsity, reconstruction becomes much harder, but "much harder" is not "impossible."

Additional privacy layers in l402-train's design:

  • Multiple local steps (H=30): Each pseudo-gradient reflects 30 steps of training, not one. This blends information from many data points, making it harder to isolate any single example.
  • Averaging across peers: The coordinator sees the average of all peers' compressed pseudo-gradients, further diluting individual contributions.
  • 2-bit quantization: Reducing precision from 32-bit floats to 2 bits destroys fine-grained information that reconstruction attacks rely on.

The combination of 30 local steps, 98.44% sparsification, 2-bit quantization, and multi-peer averaging creates multiple independent barriers to reconstruction. Each one alone is imperfect; together, they make practical gradient leakage attacks extremely difficult.

4.4 Differential Privacy

Differential privacy (DP) is the gold standard mathematical framework for privacy guarantees. The idea: add carefully calibrated noise to gradients before sharing them, guaranteeing that no single training example significantly influences the output.

How it works in federated learning:

  1. Gradient clipping: Limit the maximum contribution of any single data point by capping gradient magnitudes.
  2. Noise addition: Add Gaussian noise proportional to the clipping threshold.
  3. Privacy accounting: Track the cumulative privacy loss (epsilon, the "privacy budget") across rounds.

The tradeoff is brutal. Google's own research on Gboard shows that differential privacy reduces model accuracy by roughly 20% on simple tasks (MNIST: 75% with DP vs. 95% without). On harder tasks with non-uniform data distributions, DP can render models nonviable at reasonable privacy budgets.

The clipping dilemma: A smaller clipping threshold means less noise needed (better privacy per unit of noise), but more gradient information destroyed (worse model quality). A larger threshold preserves gradient fidelity but requires more noise. There is no universal optimal value — it depends on the model, data, and acceptable privacy-utility tradeoff.

For l402-train: DP is not part of the initial protocol design. The compression-based privacy (SparseLoCo's 98.44% sparsity + quantization) provides practical privacy at zero accuracy cost, which is a better tradeoff for a system where participants are paid for useful computation. Adding formal DP guarantees is a potential future enhancement, but the accuracy penalty is a significant concern when peers are economically motivated to produce high-quality gradients.


5. DiLoCo: The Engine Behind Decentralized Training

DiLoCo deserves a deeper explanation because it's the algorithmic foundation that makes all of this work — and its structure maps remarkably well to Lightning payment rounds.

5.1 The Two-Optimizer Architecture

DiLoCo uses two optimizers working at different timescales:

Inner optimizer (AdamW) — runs on each worker independently:

  • This is the standard training loop. Each worker takes its copy of the model, feeds it training data, computes gradients, and updates the model weights.
  • AdamW maintains per-parameter learning rate adaptation and momentum — the state-of-the-art for training language models.
  • The inner optimizer state (Adam's running averages) is never shared between workers. Each worker maintains its own.
  • The worker runs for H steps (typically 30–500) before communicating.

Outer optimizer (Nesterov momentum SGD) — runs on the synchronized global model:

  • After H inner steps, each worker computes its pseudo-gradient: the total change in model weights from start to end of the round.
pseudo_gradient = model_weights_now - model_weights_at_round_start
  • All workers send their pseudo-gradients. These are averaged.
  • The outer optimizer applies the averaged pseudo-gradient to the global model using Nesterov momentum — a technique that "looks ahead" based on the direction of recent updates, improving convergence.
  • The updated global model is distributed back to all workers for the next round.

5.2 Why Local Training Works

The surprising result is that training locally for many steps and then syncing works almost as well as syncing every step. The intuition:

  • Early in a round: each worker's model diverges from the others as it learns from its own data shard. This is fine — each worker is learning real, useful information.
  • At sync time: the pseudo-gradients capture "what each worker learned." Averaging them is similar to averaging gradients from a very large batch — it smooths out noise and preserves the signal.
  • Error feedback: In SparseLoCo, information that gets compressed away isn't lost — it's stored in an error buffer and added back next round. Over time, all information gets communicated; it just takes a few rounds instead of one step.

DiLoCo is resilient to data heterogeneity (different workers seeing different data distributions) and worker failures (if a worker drops out, the others continue).

5.3 Why This Maps to Lightning Payment Rounds

The DiLoCo round structure creates natural payment boundaries:

  1. Round starts: Coordinator issues hold invoices to all participating peers. The sats are locked but not yet settled.
  2. Peers train locally for H steps. This takes minutes to hours depending on model size and hardware.
  3. Peers submit compressed pseudo-gradients via L402-gated endpoints.
  4. Coordinator validates: Applies each pseudo-gradient to a held-out test batch and measures loss improvement. This is deterministic and replayable — anyone can verify the scoring.
  5. Settlement: Peers whose gradients improved the model have their hold invoices settled (sats released to them). Peers whose gradients were harmful or useless have their invoices cancelled (sats returned to the coordinator).

Each DiLoCo round is one payment cycle. The economic incentive (sats) is directly tied to the training contribution (validated gradient quality). No tokens, no staking, no blockchain — just conditional Lightning payments settled on proof of useful work.


6. Where l402-train Sits on the Spectrum

l402-train is not federated learning. There's no central data controller deciding what gets trained on whose data. Participants choose to join, choose their hardware, and can leave at any time.

l402-train is not fully trustless either. There is a coordinator — a service that posts training tasks, collects pseudo-gradients, validates them, and settles payments. This is a real role with real responsibility.

What makes l402-train trust-minimized rather than trusted:

Hold invoices create atomic settlement. The coordinator can't steal payments. Hold invoices lock sats in the Lightning channel; they're either settled (released to the peer) or cancelled (returned to the payer). The coordinator cannot settle an invoice without the peer first submitting a gradient, and the peer cannot receive payment without the coordinator validating the gradient. Neither party can cheat the other.

Validation is deterministic and replayable. The coordinator scores gradients by measuring loss improvement on a held-out dataset. This computation is deterministic — given the same model state, gradient, and test data, any independent observer will get the same score. A dishonest coordinator can be caught by anyone willing to replay the validation.

The coordinator is replaceable. If a coordinator behaves dishonestly (rejecting valid gradients, demanding unfair terms), peers can switch to a different coordinator. There's no lock-in. The protocol is open; the coordinator is just a service provider.

No custody of user data. Peers train on their own data and submit only compressed pseudo-gradients (1.56% density, 2-bit quantized). The coordinator never sees training data, full gradients, or inner optimizer state.

This puts l402-train in a specific position on the trust spectrum:

Federated Learning Bittensor l402-train
Trust assumption Trust operator completely Trust blockchain + subnet validators Trust-minimized: deterministic validation + conditional payments
Payment rails None TAO token (requires exchange) Bitcoin/Lightning (direct, no token)
Validator incentive Operator self-validates Blockchain consensus + staking Hold invoices: validator pays if gradient is good, auto-cancelled otherwise
Switching cost Cannot switch (operator owns infrastructure) Migrate to different subnet Point peer client at different coordinator URL
Privacy model Data stays local; operator sees full updates Data stays local; subnet sees compressed updates Data stays local; coordinator sees 1.56%-sparse, 2-bit quantized pseudo-gradients

The design philosophy: minimize trust, not eliminate it. A single coordinator is simpler, faster, and cheaper than blockchain consensus. The hold invoice mechanism ensures that even a minimally trusted coordinator can't profitably cheat. And if they do, they're trivially replaceable.


References

  1. Douillard et al. "DiLoCo: Distributed Low-Communication Training of Language Models." arXiv:2311.08105 (2023).
  2. DeepMind. "Streaming DiLoCo with overlapping communication." (2025). deepmind.google
  3. Prime Intellect. "OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training." primeintellect.ai
  4. Hivemind: Decentralized deep learning in PyTorch. github.com/learning-at-home/hivemind
  5. Covenant-72B. arXiv:2603.08163 (2026).
  6. Google Research. "Federated Learning of Gboard Language Models with Differential Privacy." arXiv:2305.18465 (2023).
  7. Zhu et al. "Deep Leakage from Gradients." NeurIPS 2019.
  8. Shan et al. "Geminio: Language-Guided Gradient Inversion Attacks in Federated Learning." ICCV 2025.
  9. Dai et al. "Activation Inversion Attacks in Decentralized Training." (2025).
  10. "Improved gradient leakage attack against compressed gradients in federated learning." Neurocomputing (2024).
  11. "Preserving data privacy in federated learning through large gradient pruning." ScienceDirect (2022).
  12. Li et al. "Auditing Privacy Defenses in Federated Learning via Generative Gradient Leakage." CVPR 2022.
  13. "Gradient Leakage Attacks in Federated Learning: Research Frontiers, Taxonomy, and Future Directions." IEEE Network (2024).