LLM hardware calculator
Pick any text-generation model on HuggingFace. See what hardware can run it, how much memory it needs, and a single-user decode tok/s estimate. Math is MoE-aware: active params drive speed, full params drive memory.
Last updated
or pick a popular one
| hardware | memory | bandwidth | price | fit | ~tok/s (decode, batch=1) | notes |
|---|
how to read this
memory
The total memory figure is weights + KV cache + 10% overhead for OS and framework (PyTorch, CUDA, etc.). The split shows you what each part costs:
- Weights โ model parameters multiplied by bytes-per-parameter at the chosen quantization. Q4_K_M is ~0.55 bytes/param for dense models, ~0.65 for MoE (MoE needs extra space for per-expert quantization scales). FP16 is 2.0 bytes/param. Pick Q4 unless you have a reason โ it's ~1-2% perplexity loss vs FP16, and below Q4 reasoning quality drops fast.
- KV cache โ keys and values cached for every token in your context window. Formula:
2 ร layers ร kv_heads ร head_dim ร ctx ร batch ร dtype_bytes. This grows linearly with context length and batch size. At 128K context the KV cache often dwarfs the model weights themselves โ that's why long-context use cases hit the memory wall hard. - KV cache is dense even for MoE models. Attention layers are not sparse โ only the FFN is. A Qwen3-235B MoE eats the same KV bytes per token as a 235B dense model would. The "MoE saves memory" intuition only applies to weights, not cache.
tok/s estimates
The number under ~tok/s (decode, batch=1) is a single-user, batch=1, short-context ceiling. The math:
tok/s โ (memory bandwidth รท active-param-bytes) ร 0.6
Decode is bandwidth-bound โ every token requires reading the active weights from memory once. The ร 0.6 factor accounts for real-world overhead (attention reads, framework, kernel launch). MoE models substitute "active params" (the experts actually activated per token) for total params, which is why a 235B-A22B MoE benchmarks closer to a 22B dense model on single-stream decode.
Where the estimate breaks down:
- Long context โ KV cache reads start dominating bandwidth alongside weight reads. Decode tok/s drops on the same hardware as context grows.
- Batched serving โ at batch > 1, MoE models route different tokens to different experts. Effective bytes-per-step climbs toward the dense equivalent. Aggregate throughput goes up, but the "active params" speedup erodes.
- Prefill (TTFT) โ long prompts activate most experts, so prefill cost is closer to dense than active params.
multi-Spark factors
The DGX Spark rows aren't simple 1ร โ Nร scaling because the interconnect and parallelism strategy matter:
- 2 Sparks (direct cable) โ single QSFP cable between two units. Officially supported by NVIDIA's connect-two-sparks playbook. Single-stream decode speed is roughly the same as 1 Spark; the win is fitting bigger models.
- 3 Sparks (switchless ring) โ three cables forming a triangle, no switch. Officially supported by NVIDIA's connect-three-sparks playbook. But this configuration is forced into pipeline-parallelism (PP=3) because tensor parallelism requires power-of-2 GPU counts, and PP=3 is sequential โ slower than a single Spark on single-stream decode. Public benchmark: ~12-18 tok/s on Qwen3.5-397B-INT4, vs ~37 tok/s on a single Spark for the same model. Buy 3 Sparks for capacity, not speed.
- 4 Sparks (switched) โ adds a 200GbE QSFP56-DD switch (MikroTik CRS812 ~$1,050 is the community pick). TP=4 works because 4 divides attention head counts evenly. With RoCE transport (not TCP โ the Grace ARM CPU's TCP stack tops out at ~2 GB/s), measured throughput hits 65 tok/s on Qwen3.5-397B-NVFP4 โ 1.88ร the single-Spark single-user number, and the breakpoint where clustering finally pays off. ht12's NVIDIA forum thread documents the working setup.
- 8 Sparks (dual-switch) โ verified by ericlewis777 with two MikroTik CRS812 switches linked by a 400G uplink. Aggregate throughput dominates: GPT-OSS-120B runs at 5,342 tok/s with 100 concurrent users. Single-stream decode tok/s plateaus โ the win at this scale is concurrency, not latency.
data sources
- Model architecture โ fetched live from each model's
config.jsonon HuggingFace. Repos must be public and have aconfig.jsonon the main branch. Quantized release repos (GGUF, AWQ, GPTQ, MLX, etc.) often don't include the original config โ pick the upstream FP16 repo from the same org instead. Gated repos (Meta's Llama models, etc.) require auth and won't load โ try ungated mirrors like NousResearch. - Total parameter count โ from HuggingFace's
safetensors.totalfield when available, otherwise computed from architecture (vocab ร hidden + per-layer attention + per-expert FFN, summed across layers and experts). - Active parameter count for MoE โ computed from architecture (embedding + all attention + active-experts ร per-expert FFN). Not authoritatively reported by HuggingFace. Cross-check with the model card if precision matters; this estimate is typically within ยฑ10% of published "Ax" figures.
- Hardware bandwidth and memory โ manufacturer specs (Apple, NVIDIA, AMD). The list is curated and not exhaustive. Multi-Spark configurations and RoCE behavior come from NVIDIA's official playbooks and public community benchmarks linked above.
- Quantization bytes-per-param โ derived from llama.cpp / GGUF reference sizes (Q4_K_M is ~4.5 bits/param for dense, ~5.2 bits/param for MoE due to per-expert scale tables).
Got a chip we should add or a number that's wrong? PR welcome.