Consumer Hardware for AI Training: A Practical Guide
What your computer can actually do — and what it can’t.
This guide covers what hardware matters for contributing training compute to a decentralized AI training protocol, what performance to expect from the machine you already own, and how background training works in practice. No ML expertise required.
1. What Hardware Matters (and Why)
AI training is matrix multiplication at scale. A model learns by multiplying enormous grids of numbers together, millions of times. Three hardware specs determine how fast your machine can do this:
GPU / Apple Silicon GPU cores. GPUs have thousands of small cores designed for parallel math. A CPU has 8–16 fast cores; a GPU has thousands of slower ones. Training is embarrassingly parallel — thousands of multiplications that don’t depend on each other — so GPUs win by 10–100x over CPUs alone. On Apple Silicon, the GPU is built into the same chip as the CPU and shares the same memory pool. On PCs, the GPU is a separate card with its own dedicated memory (VRAM).
Memory capacity (VRAM or Unified Memory). The model’s weights, the training data batch, and the optimizer state all need to fit in GPU-accessible memory simultaneously. A 7B parameter model in 16-bit precision occupies ~14 GB just for the weights. Add optimizer states and gradients, and full fine-tuning needs ~56 GB. Techniques like LoRA and QLoRA dramatically reduce this — QLoRA fits a 7B model into ~8–10 GB — but memory capacity remains the hard ceiling on what model sizes you can work with. On Apple Silicon, unified memory means the GPU can access all system RAM (8–192 GB). On NVIDIA, the GPU can only use its own VRAM (10–24 GB on consumer cards), and that’s a hard wall.
Memory bandwidth. Even if a model fits in memory, training speed depends on how fast data moves between memory and the GPU cores. This is measured in GB/s. An RTX 4090 has 1,008 GB/s of VRAM bandwidth. An M4 Pro has 273 GB/s. An M2 Ultra has 800 GB/s. Higher bandwidth means the GPU cores spend less time waiting for data and more time computing. For training workloads, bandwidth is often the bottleneck — not raw compute.
Why TFLOPS Aren’t Everything
Raw compute (TFLOPS) matters, but it’s misleading in isolation. An RTX 4090 has 82.6 FP32 TFLOPS vs. an M2 Ultra’s 27.2 TFLOPS — a 3x advantage on paper. But the RTX 4090 also has Tensor Cores optimized for the specific matrix math used in training (FP16/BF16), which push effective throughput much higher. Apple Silicon has no equivalent hardware acceleration for training-specific operations, which is why the real-world gap is closer to 3–5x for training workloads.
2. Hardware Tiers: What Can You Actually Train?
All throughput numbers below are for LoRA/QLoRA fine-tuning unless stated otherwise. “Tokens per second” (tok/s) measures training throughput — how fast the model processes training data. Higher is better. These are training speeds, not inference (generation) speeds.
Tier 1: Entry Level — MacBook Air M2/M3 (8–16 GB)
| Spec | Value |
|---|---|
| GPU cores | 8–10 |
| Memory bandwidth | 100 GB/s |
| FP32 TFLOPS | 3.5–3.6 |
| Power (training load) | 15–30 W |
What it can train: 0.5B–1B models comfortably. 3B models with QLoRA on 16 GB configurations. 7B models are technically possible on 16 GB but push into swap and slow to a crawl. 8 GB models are limited to sub-3B.
Training throughput: ~80–120 tok/s on a 0.5B model. ~40–60 tok/s on a 3B model (16 GB only).
Verdict: Viable for contributing to small model training tasks. The 8 GB configuration is marginal — 16 GB is the realistic minimum. Silent, fanless, sips power. Good for “set it and forget it” overnight training contributions.
Tier 2: Sweet Spot — Mac Mini M2 Pro / M4 Pro (16–32 GB)
| Spec | M2 Pro | M4 Pro |
|---|---|---|
| GPU cores | 19 | 20 |
| Memory bandwidth | 200 GB/s | 273 GB/s |
| FP32 TFLOPS | ~6.8 | ~8.3 |
| Power (training load) | 30–50 W | 30–50 W |
What it can train: 3B models comfortably. 7B models with QLoRA on 32 GB configurations. Batch size will be limited (1–2) for 7B.
Training throughput: ~150–200 tok/s on a 3B model. ~80–130 tok/s on a 7B model (32 GB).
Verdict: The best value node for a decentralized training network. The M4 Pro Mac Mini starts at $599 (16 GB) / $799 (24 GB). Low power, silent or near-silent, always-on form factor. The 24–32 GB configurations hit the sweet spot of capability vs. cost.
Tier 3: Workhorse — Mac Studio M2 Ultra / M4 Ultra (64–192 GB)
| Spec | M2 Ultra | M4 Ultra (est.) |
|---|---|---|
| GPU cores | 76 | 80 |
| Memory bandwidth | 800 GB/s | ~800+ GB/s |
| FP32 TFLOPS | 27.2 | ~30+ |
| Max memory | 192 GB | 192 GB |
| Power (training load) | 60–120 W | 60–120 W |
What it can train: 7B models easily, with room for larger batch sizes. 13B models with LoRA. 30B+ models with QLoRA on 192 GB configurations. Can handle model sizes that don’t fit on any consumer NVIDIA GPU.
Training throughput: ~475 tok/s on Mistral-7B LoRA (M2 Ultra, benchmark from Apple’s MLX examples). ~200–300 tok/s estimated on 13B LoRA.
Verdict: The most capable single-node training machine available to consumers. The 192 GB unified memory pool is unmatched — an RTX 4090 tops out at 24 GB VRAM. For training models in the 13B–30B range, nothing in the consumer space competes on memory capacity. The tradeoff: these machines cost $4,000–$8,000+.
Tier 4: Raw Power — Gaming PC with RTX 4090 (24 GB VRAM)
| Spec | Value |
|---|---|
| CUDA cores | 16,384 |
| Tensor cores | 512 (4th gen) |
| VRAM bandwidth | 1,008 GB/s |
| FP32 TFLOPS | 82.6 |
| FP16 Tensor TFLOPS | 330 |
| Power (training load) | 300–450 W |
What it can train: 7B models with LoRA or full QLoRA comfortably. 13B models with QLoRA (tight). Hard wall at ~13B — 24 GB VRAM is the ceiling. Cannot train the larger models that a 192 GB Mac Studio handles.
Training throughput: ~500–628 tok/s on a 1.5B model (QLoRA, PagedAdamW). ~200–350 tok/s estimated on 7B QLoRA. Roughly 3x faster than an M2 Ultra on equivalent model sizes.
Verdict: Fastest consumer training hardware by raw throughput, but VRAM-limited. The king for 7B and under. Loud, hot, power-hungry. Needs a 850W+ PSU, generates significant heat, and the fans will be audible under training load.
Tier 5: Budget NVIDIA — RTX 3080 (10 GB) / RTX 3060 (12 GB)
| Spec | RTX 3080 | RTX 3060 |
|---|---|---|
| CUDA cores | 8,704 | 3,584 |
| VRAM | 10 GB | 12 GB |
| Bandwidth | 760 GB/s | 360 GB/s |
| FP32 TFLOPS | 29.8 | 12.7 |
| Power (training) | 250–320 W | 150–170 W |
What they can train: 7B models with QLoRA (batch size 1, gradient checkpointing). The RTX 3060 actually has 2 GB more VRAM than the 3080, which paradoxically makes it slightly more capable for fitting larger models despite being slower.
Training throughput: RTX 3080: ~200–300 tok/s on 7B QLoRA (estimated). RTX 3060: ~120–180 tok/s on 7B QLoRA (estimated, based on ~60% of 4060 benchmarks).
Verdict: Viable budget training nodes. Millions of these exist in gaming PCs worldwide. The 10–12 GB VRAM ceiling limits them to 7B models with aggressive quantization, but for a protocol targeting 0.5B–3B models, these cards are more than capable. Used RTX 3060 12GB cards go for $150–200 — arguably the best value CUDA training hardware available.
Not Viable: Old AMD Mining GPUs (RX 580, etc.)
AMD’s RX 580 and similar Polaris/GCN-era GPUs are not viable for AI training:
- No software support. AMD dropped ROCm support for pre-Vega architectures. The RX 580 (gfx803) doesn’t work with ROCm 4.0+, and current PyTorch requires ROCm 5.7+. Hacky workarounds exist but are fragile and unmaintained.
- No tensor operations. These GPUs lack any equivalent to NVIDIA’s Tensor Cores or even basic FP16 acceleration for matrix math.
- 8 GB VRAM with 256 GB/s bandwidth. Even if the software worked, performance would be worse than a modern CPU.
Newer AMD GPUs (RX 7900 XTX with 24 GB) are viable via ROCm, but the old mining cards are e-waste for this purpose.
Not Viable: Raspberry Pi / Low-End ARM
Raspberry Pi and similar ARM SBCs cannot train models at all:
- No GPU compute. The Pi’s VideoCore GPU is for display output, not general-purpose computing. All ML work runs on the CPU.
- 4–8 GB shared RAM at 32–50 GB/s bandwidth. Orders of magnitude too slow.
- CPU-only performance: A Pi 5 achieves ~0.5–3 FPS on MobileNet inference. Training is 10–100x more compute-intensive than inference. A single training step on even a tiny model would take minutes.
For perspective: a task that takes an RTX 4090 one second would take a Raspberry Pi roughly an hour. These devices are useful for inference on pre-trained tiny models, but training is completely out of reach.
3. MLX vs. PyTorch vs. CUDA: What Runs Where
| Framework | Hardware | Training Support | Maturity |
|---|---|---|---|
| MLX | Apple Silicon only | LoRA/QLoRA fine-tuning, full training | Young but rapidly improving |
| PyTorch + CUDA | NVIDIA GPUs | Full training, all techniques | Gold standard, 7+ years |
| PyTorch + MPS | Apple Silicon | Basic training | Second-class citizen |
| PyTorch + ROCm | AMD GPUs (RDNA3+) | Full training | Usable but less tested |
MLX is Apple’s native ML framework, built specifically for unified memory architecture. It eliminates the CPU-to-GPU data transfer overhead that plagues discrete GPUs. For LoRA fine-tuning on Apple Silicon, MLX is 20–30% faster than PyTorch’s MPS backend and the only serious option for training on Mac. Its main limitation: no multi-node distributed training and a smaller ecosystem of pre-built training recipes.
PyTorch + CUDA remains the industry standard. Every training technique, every optimization, every research paper — CUDA first. Tensor Cores give NVIDIA GPUs 2–4x throughput advantage for FP16/BF16 training operations. The ecosystem is unmatched: Hugging Face Transformers, DeepSpeed, bitsandbytes (QLoRA), Flash Attention — all CUDA-first.
PyTorch + MPS (Metal Performance Shaders) is Apple’s PyTorch GPU backend. It works but is slower than MLX, lacks Flash Attention, doesn’t support distributed training, and many operations silently fall back to CPU. Not recommended for training.
For a decentralized training protocol: The coordinator dispatches tasks appropriate to each peer’s framework. CUDA peers get compute-intensive gradient calculations. MLX peers get tasks sized for their bandwidth and memory. The protocol doesn’t care what framework computes the gradients — only that the gradients are correct.
Training Speed: MLX vs. CUDA (Same Model)
For a concrete comparison — Mistral-7B LoRA fine-tuning:
| Hardware | Framework | Training tok/s | Power |
|---|---|---|---|
| M1 Max 32 GB | MLX | ~260 | ~40 W |
| M2 Ultra 192 GB | MLX | ~475 | ~90 W |
| RTX 4090 24 GB | PyTorch/CUDA | ~500–628 | ~350 W |
| RTX 4060 8 GB | PyTorch/CUDA | ~500 | ~150 W |
| RTX 3060 12 GB | PyTorch/CUDA | ~150–200 | ~160 W |
The RTX 4090 is faster in absolute terms, but the M2 Ultra reaches ~80% of its throughput at 25% of the power draw. Per-watt efficiency strongly favors Apple Silicon.
4. Realistic Model Size Expectations
What can you actually fine-tune on your hardware? This table shows the minimum memory required and which hardware tiers qualify:
| Model Size | QLoRA Memory | LoRA Memory | Full Fine-Tune | Minimum Hardware |
|---|---|---|---|---|
| 0.5B | ~2 GB | ~4 GB | ~8 GB | Any Mac (8 GB+), any NVIDIA (8 GB+) |
| 1B | ~3 GB | ~6 GB | ~16 GB | Any Mac (8 GB+), any NVIDIA (8 GB+) |
| 3B | ~5 GB | ~10 GB | ~40 GB | Mac 16 GB+, RTX 3060+ |
| 7B | ~8 GB | ~18 GB | ~56 GB | Mac 16 GB+ (tight), RTX 3060+ (tight) |
| 13B | ~14 GB | ~32 GB | ~104 GB | Mac 32 GB+, no consumer NVIDIA |
| 30B | ~30 GB | ~70 GB | ~240 GB | Mac Studio 64 GB+ only |
For a protocol targeting 0.5B–3B models: Nearly every modern Mac and most gaming PCs with discrete NVIDIA GPUs can participate. This is the accessible sweet spot — proving the coordination mechanism doesn’t require training competitive frontier models.
The memory advantage of Apple Silicon is real but nuanced. A 192 GB Mac Studio can fit models that no consumer GPU can touch. But for the 0.5B–7B range where most consumer hardware operates, NVIDIA’s faster compute and mature tooling make it faster per-token. The Mac’s advantage is silence, power efficiency, and the ability to run unattended.
5. Power, Heat, and Noise
Training is a sustained full-load workload. Unlike gaming (which fluctuates), training pegs the GPU at 95–100% utilization for hours or days. Power consumption and thermal output matter.
| Hardware | Training Power Draw | Heat Output | Noise Level | Annual Power Cost (24/7)* |
|---|---|---|---|---|
| MacBook Air M2/M3 | 15–30 W | Warm to touch | Silent (fanless) | $13–26 |
| Mac Mini M4 Pro | 30–50 W | Mild | Near-silent | $26–44 |
| Mac Studio M2 Ultra | 60–120 W | Moderate | Quiet fan hum | $53–105 |
| PC + RTX 3060 | 200–250 W (system) | Hot exhaust | Audible fans | $175–219 |
| PC + RTX 3080 | 300–400 W (system) | Hot exhaust | Loud fans | $263–350 |
| PC + RTX 4090 | 450–600 W (system) | Very hot | Loud fans | $394–526 |
*Estimated at US average $0.10/kWh. Actual costs vary by region.
The efficiency gap is enormous. An M4 Pro Mac Mini doing training work at 40W produces roughly the same heat as a light bulb. An RTX 4090 system at 500W is a space heater. For always-on training nodes — the kind a decentralized protocol needs — power efficiency directly impacts profitability. If you’re earning sats for compute, every watt you burn cuts into your margin.
6. “Will It Slow Down My Computer?”
Short answer: it can, but a well-designed protocol won’t let it.
How Background Training Works
Training is a batch process — it processes chunks of data, computes gradients, and submits results. Between batches, the GPU is briefly idle. This creates natural scheduling opportunities:
On macOS:
- The
taskpolicyutility can set processes to “background” QoS (Quality of Service). Background processes run on Efficiency cores, yield to any user activity, and throttle I/O. - macOS natively schedules background GPU work at lower priority. A training process can use the GPU when you’re not actively using GPU-intensive apps.
- Unified memory means no VRAM contention — but a training process using 12 GB of a 16 GB system will leave limited headroom for other apps.
On Linux/Windows (NVIDIA):
- NVIDIA’s MPS (Multi-Process Service) allows GPU sharing between processes.
niceandionicecontrol CPU and I/O priority. A training process at niceness +19 will yield to everything.- CUDA processes can limit GPU utilization (e.g., cap at 50%) to leave headroom for display and other tasks.
- The GPU’s dedicated VRAM means training doesn’t compete with system RAM — but the GPU is effectively occupied. You can’t game while training.
What Users Will Actually Experience
| Situation | Impact |
|---|---|
| Web browsing, email, office apps | No noticeable impact. Training uses GPU; these use CPU. |
| Video calls | Minimal impact. Video encoding uses dedicated hardware, not GPU compute cores. |
| Photo/video editing | Moderate impact on Mac (shared GPU). Minimal on PC (dedicated VRAM). |
| Gaming | Heavy impact on PC (GPU fully occupied). Not applicable on Mac. |
| Running other ML models | Significant impact. Two ML workloads compete for the same resources. |
| System responsiveness | Background priority means the OS always preempts training for user interaction. |
The Protocol’s Role
A well-designed coordinator assigns work appropriate to each peer’s available resources. A MacBook Air gets 0.5B model tasks. A Mac Studio gets 7B tasks. A gaming PC gets burst tasks when the user signals availability. Peers can pause, throttle, or resume based on local system load — the protocol treats this as normal, not as failure. Gradient submissions have deadlines, not uptime requirements.
The goal: training should be like Bitcoin mining was in 2010 — something your computer does in the background, earning sats while you sleep.
Sources
- Benchmarking On-Device Machine Learning on Apple Silicon with MLX — Academic benchmarks of MLX inference on Apple Silicon
- Profiling Apple Silicon Performance for ML Training — Research paper on Apple Silicon training performance gaps
- Apple Silicon vs NVIDIA CUDA: AI Comparison 2025 — ResNet-50 training comparisons, power consumption data
- Apple Silicon GPU Architecture Explained — TFLOPS and bandwidth specs for all Apple Silicon variants
- Profiling LoRA/QLoRA Fine-Tuning on Consumer GPUs — RTX 4060 fine-tuning throughput benchmarks
- MLX LoRA Examples — Official Apple MLX fine-tuning benchmarks (475 tok/s M2 Ultra)
- NVIDIA RTX 4090 for AI and ML — RTX 4090 training capabilities and benchmarks
- GPU Requirements for Running AI Models — VRAM requirements by model size
- How Much VRAM for LLM Fine-Tuning — Memory calculator methodology
- ROCm for Old AMD GPUs — RX 580 ROCm compatibility issues
- Deep Learning with Raspberry Pi — Pi training viability analysis
- LLM Fine-Tuning on RTX 4090: 90% Performance at 55% Power — Power-limited training benchmarks
- Mac Mini M4 Pro Local AI Review — M4 Pro inference and training benchmarks
- Best Local LLMs for Mac 2026 — Model size recommendations by hardware