LLM hardware calculator

Pick any text-generation model on HuggingFace. See what hardware can run it, how much memory it needs, and a single-user decode tok/s estimate. Math is MoE-aware: active params drive speed, full params drive memory.

Last updated April 28, 2026

model

or pick a popular one

quantization

context length: 8192

batch / concurrent users

KV cache precision

total memory needed

—

pick a model to compute

hardware	memory	bandwidth	price	fit	~tok/s (decode, batch=1)	notes

how to read this

memory

The total memory figure is weights + KV cache + 10% overhead for OS and framework (PyTorch, CUDA, etc.). The split shows you what each part costs:

Weights — model parameters multiplied by bytes-per-parameter at the chosen quantization. Q4_K_M is ~0.55 bytes/param for dense models, ~0.65 for MoE (MoE needs extra space for per-expert quantization scales). FP16 is 2.0 bytes/param. Pick Q4 unless you have a reason — it's ~1-2% perplexity loss vs FP16, and below Q4 reasoning quality drops fast.
KV cache — keys and values cached for every token in your context window. Formula: 2 × layers × kv_heads × head_dim × ctx × batch × dtype_bytes. This grows linearly with context length and batch size. At 128K context the KV cache often dwarfs the model weights themselves — that's why long-context use cases hit the memory wall hard.
KV cache is dense even for MoE models. Attention layers are not sparse — only the FFN is. A Qwen3-235B MoE eats the same KV bytes per token as a 235B dense model would. The "MoE saves memory" intuition only applies to weights, not cache.

tok/s estimates

The number under ~tok/s (decode, batch=1) is a single-user, batch=1, short-context ceiling. The math:

tok/s ≈ (memory bandwidth ÷ active-param-bytes) × 0.6

Decode is bandwidth-bound — every token requires reading the active weights from memory once. The × 0.6 factor accounts for real-world overhead (attention reads, framework, kernel launch). MoE models substitute "active params" (the experts actually activated per token) for total params, which is why a 235B-A22B MoE benchmarks closer to a 22B dense model on single-stream decode.

Where the estimate breaks down:

Long context — KV cache reads start dominating bandwidth alongside weight reads. Decode tok/s drops on the same hardware as context grows.
Batched serving — at batch > 1, MoE models route different tokens to different experts. Effective bytes-per-step climbs toward the dense equivalent. Aggregate throughput goes up, but the "active params" speedup erodes.
Prefill (TTFT) — long prompts activate most experts, so prefill cost is closer to dense than active params.

multi-Spark factors

The DGX Spark rows aren't simple 1× → N× scaling because the interconnect and parallelism strategy matter:

2 Sparks (direct cable) — single QSFP cable between two units. Officially supported by NVIDIA's connect-two-sparks playbook. Single-stream decode speed is roughly the same as 1 Spark; the win is fitting bigger models.
3 Sparks (switchless ring) — three cables forming a triangle, no switch. Officially supported by NVIDIA's connect-three-sparks playbook. But this configuration is forced into pipeline-parallelism (PP=3) because tensor parallelism requires power-of-2 GPU counts, and PP=3 is sequential — slower than a single Spark on single-stream decode. Public benchmark: ~12-18 tok/s on Qwen3.5-397B-INT4, vs ~37 tok/s on a single Spark for the same model. Buy 3 Sparks for capacity, not speed.
4 Sparks (switched) — adds a 200GbE QSFP56-DD switch (MikroTik CRS812 ~$1,050 is the community pick). TP=4 works because 4 divides attention head counts evenly. With RoCE transport (not TCP — the Grace ARM CPU's TCP stack tops out at ~2 GB/s), measured throughput hits 65 tok/s on Qwen3.5-397B-NVFP4 — 1.88× the single-Spark single-user number, and the breakpoint where clustering finally pays off. ht12's NVIDIA forum thread documents the working setup.
8 Sparks (dual-switch) — verified by ericlewis777 with two MikroTik CRS812 switches linked by a 400G uplink. Aggregate throughput dominates: GPT-OSS-120B runs at 5,342 tok/s with 100 concurrent users. Single-stream decode tok/s plateaus — the win at this scale is concurrency, not latency.

data sources

Model architecture — fetched live from each model's config.json on HuggingFace. Repos must be public and have a config.json on the main branch. Quantized release repos (GGUF, AWQ, GPTQ, MLX, etc.) often don't include the original config — pick the upstream FP16 repo from the same org instead. Gated repos (Meta's Llama models, etc.) require auth and won't load — try ungated mirrors like NousResearch.
Total parameter count — from HuggingFace's safetensors.total field when available, otherwise computed from architecture (vocab × hidden + per-layer attention + per-expert FFN, summed across layers and experts).
Active parameter count for MoE — computed from architecture (embedding + all attention + active-experts × per-expert FFN). Not authoritatively reported by HuggingFace. Cross-check with the model card if precision matters; this estimate is typically within ±10% of published "Ax" figures.
Hardware bandwidth and memory — manufacturer specs (Apple, NVIDIA, AMD). The list is curated and not exhaustive. Multi-Spark configurations and RoCE behavior come from NVIDIA's official playbooks and public community benchmarks linked above.
Quantization bytes-per-param — derived from llama.cpp / GGUF reference sizes (Q4_K_M is ~4.5 bits/param for dense, ~5.2 bits/param for MoE due to per-expert scale tables).

Got a chip we should add or a number that's wrong? PR welcome.