Covenant-72B: Deep Technical Analysis

Date: 2026-03-12 · Paper: arxiv.org/abs/2603.08163 · Model: HuggingFace · License: Apache 2.0

Executive Summary

Covenant-72B is a 72-billion parameter LLM pre-trained over the internet by 70+ permissionless peers using the Bittensor blockchain (Subnet 3) for coordination. It is the largest model ever trained in a fully decentralized, non-whitelisted manner. The core technical innovation is SparseLoCo, a communication-efficient optimizer achieving 146x gradient compression, which makes synchronization over commodity internet feasible. The model was trained on ~1.1T tokens of DCLM web text and achieves results competitive with LLaMA-2-70B (trained on 2T tokens in a centralized datacenter).

Bottom line: A legitimate and impressive systems/coordination achievement. The model itself is mediocre by 2026 standards — roughly LLaMA-2-70B quality, which is 2–3 generations behind current frontier open models. The significance is entirely in proving the method works at scale, not in producing a useful model.

1. Technical Architecture

1.1 SparseLoCo Algorithm

SparseLoCo (Sparse Low-Communication) is a variant of DiLoCo (Distributed Low-Communication) from DeepMind. The key idea: instead of synchronizing gradients every step (which requires enormous bandwidth), each peer trains locally for many steps, then shares a heavily compressed summary of what it learned.

Algorithm steps per round:

Local training (H=30 inner steps): Each peer runs AdamW on its own data shard for 30 optimization steps, accumulating ~192 sequences per batch (seq_len=2048).
Pseudo-gradient computation: After 30 steps, compute the difference between current weights and the weights at the start of the round:
```
Delta_r = theta_current - theta_start_of_round
```
This "pseudo-gradient" captures what the peer learned.
Error-feedback accumulation: Combine the new pseudo-gradient with previously discarded information:
```
combined = 0.95 * error_buffer + Delta_r
```
The error buffer remembers what got thrown away last round (critical for convergence).
Top-k sparsification: Divide the combined tensor into chunks of 4096 elements. Within each chunk, keep only the k=64 largest-magnitude values (1.56% density). Everything else goes back into the error buffer.
2-bit quantization: Quantize the surviving values to 2 bits.
Index encoding: Each selected value needs its position encoded — 12 bits per value (the information-theoretic minimum is ~7.36 bits for choosing 64 from 4096).
All-gather: Peers upload compressed pseudo-gradients to Cloudflare R2 object storage. Other peers download and average them.

Global update: Apply the averaged compressed pseudo-gradient:

theta_new = theta_old - alpha * (1/R) * sum(compressed_deltas)

1.2 The 146x Compression Ratio

This number comes from composing three compression stages:

Temporal compression: Synchronize every 30 steps instead of every step → ~30x
Spatial compression (top-k): Keep 64 out of 4096 elements per chunk → ~64x within the synced message
Bit compression: 2-bit quantization + 12-bit indices instead of 32-bit floats → additional ~2.3x

Combined: the per-round communication is 146x smaller than naive dense gradient sync every step.

Practical impact: A 72B model's full gradient is ~290 GB in FP32. With 146x compression, each peer sends/receives ~2 GB per round. At 110 Mb/s uplink, that takes ~150 seconds. Actual measured communication time: 70 seconds per round (likely due to overlap and pipeline optimization). Computation per round: 20 minutes. Resulting compute utilization: ~94.5%.

1.3 Gauntlet Validator

The Gauntlet is the system that prevents free-riders, lazy peers, and adversarial attacks in a permissionless network. It runs on Bittensor's blockchain infrastructure.

Scoring mechanisms:

LossScore: The primary signal. The validator takes a peer's submitted pseudo-gradient and measures the loss improvement on a held-out batch before vs. after applying it. If your gradient doesn't help, you score poorly.
Assigned vs. unassigned data check: Each peer is assigned specific data shards. The validator checks whether the gradient helps more on the peer's assigned data than on random data — this catches peers who copy gradients from others rather than doing their own training.
Norm calibration: Pseudo-gradients are scaled relative to the median norm across all submissions. This prevents a peer from submitting outsized or undersized updates.
OpenSkill ranking: Scores are accumulated over time using an ELO-like system (OpenSkill), creating a reputation that's hard to game with a single round.
Liveness and sync checks: Validators verify that peers are actually synchronized with the current model state — you can't submit stale gradients from an old checkpoint.

Key design point: Not every peer is evaluated every round. A random subset of peers is scored on a random subset of data, keeping validation costs manageable while maintaining statistical deterrence.

1.4 Blockchain Component (Bittensor Subnet 3)

Bittensor is a decentralized network where "subnets" run specific AI tasks. Each subnet has:

Miners: Do the actual work (in this case, training the model)
Validators: Score the miners' contributions (run the Gauntlet)
TAO token: The native cryptocurrency. Validators stake TAO and set weights on miners. The Bittensor consensus mechanism (Yuma Consensus) translates these weights into TAO emissions — miners who contribute better gradients earn more TAO.

Covenant runs on Subnet 3 (also called "Templar" or "tplr"). The team behind it is Covenant AI (formerly Templar), led by Samuel Dare, with researchers including Joel Lidin, Amir Sarfi, and Eugene Belilovsky (Mila/Concordia).

The incentive loop: Peers invest ~8x B200 GPUs → train honestly → Gauntlet scores them highly → they earn TAO emissions → TAO has market value → covers GPU costs (ideally). This creates an economic flywheel where the better the model gets, the more valuable participation becomes (in theory).

1.5 Peer Coordination

No central cluster. Peers discover each other through the Bittensor blockchain.
Dynamic participation. Peers join and leave freely. Average active peers: 24.4 per round. Average actually contributing to aggregation: 16.9 (capped at 20). Over the full run: 70+ unique participants.
Asynchronous communication. Compressed pseudo-gradients are uploaded to Cloudflare R2 (object storage). Other peers download asynchronously. This avoids the need for all peers to be online simultaneously.
Fault tolerance. If a peer drops out, the round continues without it. The outer learning rate was adjusted during training (reduced from 1.0 to 0.65 at step 110K) based on training dynamics, but this was a manual intervention, not an automatic mechanism.

1.6 Hardware Requirements

Per peer minimum: 8x NVIDIA B200 (or equivalent, e.g., 8x H200)

This is NOT consumer hardware. An 8x B200 node costs roughly $200–300K+ to purchase, or $15–25/hour to rent. The "commodity internet" claim refers to bandwidth (commodity 500 Mb/s down / 110 Mb/s up), not to commodity hardware. You need serious GPUs.

Parallelism: Dynamic FSDP (Fully Sharded Data Parallel) across the 8 GPUs within each peer. The error-feedback buffer is sharded using the same FSDP strategy to avoid doubling memory.

Network: 500 Mb/s downlink, 110 Mb/s uplink — this is the key claim. Regular internet, not InfiniBand.

1.7 Data Pipeline

Pre-training (main phase, ~1.09T tokens):

Dataset: DCLM-baseline-1.0 (Diverse Corpus of Language Models) — curated web text
English only
Pre-tokenized and hosted on object storage
Each peer gets distinct (potentially overlapping) data shards
Background shard downloading for seamless replacement

Annealing phase (~14.2B tokens):

Higher-quality data mixture:
- Instruction data: ~27%
- Synthetic web: ~20%
- Code: ~15%
- Math: ~13%
- Pre-training replay: ~25%
Outer learning rate reduced + rapid inner LR decay

Post-training (Covenant-72B-Chat):

Stage 1: 36,500 SFT steps at 4K context, batch size 256
Stage 2: 20,500 SFT steps at 8K context + 20% pre-training replay
Both stages: AdamW, weight decay 0.01, gradient clipping 1.0

2. Model Architecture

Parameter	Value
Parameters	72,747,327,488 (72.7B)
Layers	80
Hidden size	8192
Intermediate size	28672
Query heads	64
KV heads	8 (GQA)
Head dimension	128
RoPE base frequency	500,000
Vocabulary size	262,208
Tokenizer	Gemma 3 SentencePiece
Pre-training context length	2048
Post-training context length	8192

Architecture is LLaMA-style (LlamaForCausalLM) with Grouped Query Attention (8 KV heads for 64 query heads). Nothing novel about the architecture itself — the innovation is entirely in the training method.

Context length note: 2048 during pre-training is very short by 2026 standards. The SFT stage extends to 8K, which is still short. Modern models support 128K+ (LLaMA 3.1, Qwen 2.5).

3. Benchmarks & Quality Assessment

3.1 Pre-Training Benchmarks (0-shot)

Benchmark	Covenant-72B	LLM360 K2 (65B, 1.4T tok)	LLaMA-2-70B (2T tok)
ARC-Challenge	56.8	53.8	57.4
ARC-Easy	80.9	76.0	79.6
PIQA	81.6	82.5	82.6
OpenBookQA	44.0	48.0	49.4
HellaSwag	80.6	82.9	84.3
WinoGrande	75.9	76.4	80.4
MMLU	67.1	65.5	65.6

Interpretation: Covenant-72B wins on MMLU (+1.5 over LLaMA-2-70B) and ARC-Easy (+1.3), but loses on HellaSwag (-3.7), WinoGrande (-4.5), OpenBookQA (-5.4), and PIQA (-1.0). Overall it's roughly competitive with LLaMA-2-70B — a fair characterization given the 1.1T vs 2T token disadvantage.

3.2 Decentralized Baselines

Model	Size	Tokens	MMLU	ARC-C	Whitelisted?
INTELLECT-1	10B	1T	32.7	44.8	Yes (curated participants)
Psyche Consilience	40B	1.2T	24.2	31.1	Yes
Covenant-72B	72B	1.1T	67.1	56.8	No (permissionless)

Covenant-72B dramatically outperforms prior decentralized efforts. Psyche Consilience at 40B with 1.2T tokens getting only 24.2 MMLU is bizarre — it suggests that project had serious training instability issues. INTELLECT-1 at 10B/32.7 MMLU is more reasonable for its scale.

3.3 Chat Model (5-shot)

Benchmark	Covenant-72B-Chat	K2-Chat (65B)	LLaMA-2-70B-Chat
ARC-Challenge	64.2	62.0	65.4
GSM8K	63.9	79.0	52.2
MMLU	67.4	67.9	63.1
IFEval	64.7	45.5	40.7
MATH	26.3	19.1	10.7
MMLU-Pro	40.9	45.4	35.2

The chat model shows clear IFEval and MATH advantages over LLaMA-2-70B-Chat, but falls short of K2-Chat on GSM8K and MMLU-Pro.

3.4 Honest Comparison to Modern 70B Models (2026 Context)

This is where the picture gets sobering:

Metric	Covenant-72B	LLaMA 3.1 70B	Qwen 2.5 72B
MMLU	67.1	79.3	~85
ARC-Challenge	56.8	92.9	~90+
Training tokens	1.1T	15T+	18T
Context length	2K (8K chat)	128K	128K

The gap is enormous. LLaMA 3.1 70B scores 79.3 on MMLU vs. Covenant's 67.1. On ARC-Challenge, it's 92.9 vs. 56.8 — a 36-point gap. Qwen 2.5 72B is even further ahead. This isn't surprising: those models were trained on 14–16x more data with state-of-the-art data curation.

4. Heterogeneous SparseLoCo (Follow-Up Paper)

Paper: arxiv.org/abs/2601.02360
Authors: Yazan Obeidi, Amir Sarfi, Joel Lidin (Covenant AI), Paul Janson, Eugene Belilovsky (Mila/Concordia)

This paper addresses the biggest practical limitation of Covenant-72B: every peer needs identical high-end hardware (8x B200). Heterogeneous SparseLoCo allows peers with different hardware to participate.

How it works:

Peers with enough GPU memory host a full model replica (standard SparseLoCo)
Peers with less memory split the model across GPUs using pipeline parallelism
Inter-stage activations (which normally require high bandwidth within a pipeline) are compressed using subspace projection — project activations onto a low-rank subspace via random orthonormal matrix U

Key results (tested at 178M – 1B scale):

At 87.5% activation compression: 3.3–3.8% loss degradation
At 99% compression: 7.4–8.1% degradation
Heterogeneous setups (mix of compressed and uncompressed) outperform uniform compression
The advantage grows with more aggressive compression (2.6 percentage points at 99.9%)

Practical implication:

This could eventually allow peers with 4x or even 2x GPUs to participate alongside 8x GPU peers, dramatically lowering the barrier to entry. But it's only been validated at small scale (up to 1B parameters). Whether it works at 72B is unproven.

5. Significance & Critical Assessment

5.1 What's genuinely impressive

First permissionless large-scale training. INTELLECT-1 and Psyche both whitelisted participants. Covenant let anyone join. Making this work with Byzantine fault tolerance is a real systems achievement.
94.5% compute utilization. Despite training a model 7.2x larger than INTELLECT-1, Covenant achieved higher utilization (94.5% vs. 82.1%) with lower per-round communication overhead (70 seconds vs. 8.3 minutes). The compression engineering is excellent.
Convergence despite extreme compression. 146x compression with error feedback actually works — the model converges to competitive performance. This is a non-obvious result.
Scale milestone. 72B is the largest model ever trained this way, by a factor of ~2x over the next largest (Psyche at 40B, which didn't even work well).

5.2 What's not impressive / limitations

The model is weak by 2026 standards. LLaMA-2-70B quality puts it roughly 2–3 years behind the frontier. You would never choose Covenant-72B for any practical application when LLaMA 3.1, Qwen 2.5, or DeepSeek V3 exist.
"Commodity internet" is misleading — the hardware isn't commodity at all. 8x B200 GPUs is a ~$200–300K investment per peer. The 70+ "unique participants" likely includes many cloud instances rented by a small number of entities, not 70 different organizations.
Only 1.1T tokens. Modern models train on 15–18T tokens. The team could argue "same compute budget comparison is fair," but the result is a model that's not useful in practice.
2048 context length is absurdly short. Even the chat model only extends to 8K. This alone makes it impractical.
Benchmark cherry-picking. Comparing against LLaMA-2-70B (July 2023) in March 2026 lets you claim "competitive" while avoiding embarrassing comparisons to current models.
The crypto angle. Bittensor's TAO token creates real incentives but also real conflicts of interest. The project announcement is structured to pump the token as much as to advance science.

5.3 Comparison to prior decentralized training

Project	Date	Scale	Permissionless?	Algorithm	Quality
DiLoCo (DeepMind)	2023	Research paper	N/A (internal)	DiLoCo	Proof of concept
INTELLECT-1 (PrimeIntellect)	2024	10B, 1T tokens	Whitelisted	DiLoCo + int8	Weak (32.7 MMLU)
Psyche Consilience	2025	40B, 1.2T tokens	Whitelisted	DiLoCo variant	Broken (24.2 MMLU)
Covenant-72B	2026	72B, 1.1T tokens	Permissionless	SparseLoCo	Decent (67.1 MMLU)

Covenant is clearly the best result in this lineage. The jump from 10B to 72B with better quality is meaningful. The permissionless aspect is a genuine advance.

6. Bittensor / Token Economics

TAO is Bittensor's native token with a fixed supply cap of 21 million (modeled after Bitcoin)
Subnets receive TAO emissions based on their "weight" in the network (set by validators and root network)
Within Subnet 3 (Covenant/Templar), miners earn TAO proportional to their Gauntlet scores
Validators stake TAO and set weights on miners — their staking weight determines how much influence their scoring has
The economic proposition: spend $X on GPU rental → earn $Y in TAO → if Y > X, mining is profitable
This creates a market-driven compute allocation — if TAO price rises, more miners join, more compute is available
Criticism: This is ultimately a proof-of-stake system where the wealthy (large TAO holders) control which miners get rewarded. The "decentralization" is real but not as egalitarian as it sounds.

7. Relevance Assessment

Is this relevant for running local models?

No. This is about training, not inference. The resulting model (72B, 2048 context, LLaMA-2-tier quality) is not interesting for local inference — there are dramatically better options at every size.

Is this relevant for decentralized AI infrastructure?

Yes, significantly. If you believe that compute concentration is a problem (a few companies control frontier training), this is the most credible demonstration that decentralized training can work at meaningful scale. The SparseLoCo algorithm and Gauntlet validator are genuine contributions.

Could this pattern be used for fine-tuning?

Likely yes, and more practically. Fine-tuning requires far less compute and communication than pre-training. SparseLoCo's communication efficiency would be even more beneficial for fine-tuning, where you could have many more participants with smaller hardware. This could be a more practical near-term application.

Could this catch up to frontier models?

Theoretically, if scaled. The team's argument: "70 contributors is just the proof of concept. The whole bet is that the approach scales to thousands." If they could aggregate 10x more peers and train on 15T+ tokens, the quality gap could close. Whether that's economically viable via TAO emissions is the real question.

What's the practical takeaway?

Watch the method, ignore the model. The SparseLoCo algorithm and the proof that permissionless Byzantine-tolerant training works at 72B scale are the contributions. The model itself is an artifact of the proof, not a useful product. If this team (or someone using their methods) can scale to modern data volumes, it becomes much more interesting.

8. Team & Organization

Authors (paper): Joel Lidin, Amir Sarfi, Erfan Miahi, Quentin Anthony, Shivam Chauhan, Evangelos Pappas, Benjamin Therien, Eugene Belilovsky, Samuel Dare

Affiliations:

Covenant AI (formerly Templar project)
Mila / Concordia University (Eugene Belilovsky, academic advisor)

Entity: @tplr_ai on X (Templar). @covenant_ai is the project account. The team is relatively small, with academic connections to Mila (Montreal's AI institute).

References

Covenant-72B paper: arxiv.org/abs/2603.08163
Heterogeneous SparseLoCo: arxiv.org/abs/2601.02360
Model weights: huggingface.co/1Covenant/Covenant-72B
Chat model: huggingface.co/1Covenant/Covenant-72B-Chat
DCLM dataset: huggingface.co/datasets/mlfoundations/dclm-baseline-1.0-parquet
INTELLECT-1: primeintellect.ai/blog/intellect-1
DiLoCo (DeepMind): Douillard et al., 2023