LLM hardware calculator

Pick any text-generation model on HuggingFace. See what hardware can run it, how much memory it needs, and a single-user decode tok/s estimate. Math is MoE-aware: active params drive speed, full params drive memory.

Last updated

    total memory needed
    โ€”
    pick a model to compute
    hardware memory bandwidth price fit ~tok/s (decode, batch=1) notes

    how to read this

    memory

    The total memory figure is weights + KV cache + 10% overhead for OS and framework (PyTorch, CUDA, etc.). The split shows you what each part costs:

    tok/s estimates

    The number under ~tok/s (decode, batch=1) is a single-user, batch=1, short-context ceiling. The math:

    tok/s โ‰ˆ (memory bandwidth รท active-param-bytes) ร— 0.6

    Decode is bandwidth-bound โ€” every token requires reading the active weights from memory once. The ร— 0.6 factor accounts for real-world overhead (attention reads, framework, kernel launch). MoE models substitute "active params" (the experts actually activated per token) for total params, which is why a 235B-A22B MoE benchmarks closer to a 22B dense model on single-stream decode.

    Where the estimate breaks down:

    multi-Spark factors

    The DGX Spark rows aren't simple 1ร— โ†’ Nร— scaling because the interconnect and parallelism strategy matter:

    data sources

    Got a chip we should add or a number that's wrong? PR welcome.