local ai benchmarks
Dated local-model runs and cloud comparison profiles with hardware, runtime, model, quantization, context, pass rate, and next experiments.
Last updated
Local Inference Benchmarks1. Current snapshot
This page tracks measured local AI runs. It is not a vendor leaderboard. Each row should keep the date, hardware, runtime, model, quantization, context, pass rate, and exact benchmark conditions together so progress is visible over time. Cloud profiles are tracked as comparison lanes, not local hardware lanes. Comparable result tables are sorted by best performer first.
| Lane / profile | Why | Status |
|---|---|---|
qwen3-coder-30b.gguf on DGX Spark llama.cpp | Fastest measured Spark path; passed 14/14 across smoke, code, question, and wiki suites. | Default Spark profile target |
qwen3-coder:30b on DGX Spark Ollama | Passed practical checks except long-prefill timeout, but is much slower than llama.cpp for the same model bits. | Model management / fallback |
qwen3-coder-next:q4_K_M on DGX Spark Ollama | Passed all daily-agent tasks but is slower. Keep it for harder coding-quality tests. | Quality candidate |
deepseek-v4-flash through ds4 on M5 Max | Good Apple-side side engine, but weak strict JSON, citation, and abstention gates in this harness. | Side engine |
gemini-fast and gemini-pro via M5 | Cloud comparison and escalation lanes for the same practical tasks. Track wall time, tool validity, context used, and cost separately. | Comparator, not local inference |
size_vram=0 in /api/ps. Treat Ollama as the convenience/model-management path until Spark shell verification and direct llama.cpp/SGLang/vLLM runs cover the same models.
2. How to read the numbers
The benchmark separates three layers. Raw tokens per second is useful, but local coding agents should be judged on task success first.
| Layer | Measures | Use for |
|---|---|---|
| Engine | Prefill tok/s, generation tok/s, memory, context/KV limits | MLX, ds4, llama.cpp tuning |
| Server/API | End-to-end latency, TTFT, cache behavior, HTTP errors | Ollama, ds4-server, mlx_lm.server |
| Agent loop | Pass rate, wall time, tool validity, stalls, files/tests | Pi, Claude, Codex, OpenAI-compatible side profiles |
| Cloud comparator | Pass rate, wall time, TTFT, cost/task, tool validity, context used | Gemini side profiles measured beside local profiles |
Promote a profile only when it improves correctness, tool-call validity, and stall behavior. A fast chat response is not enough. Do not compare Gemini token/s directly against ds4 or Spark CUDA as if all three were local hardware runtimes.
3. Run registry
Individual measured profile rows are ordered by current benchmark standing. Aggregate run rows stay last because they describe a task set, not one performer.
| Date | Hardware | Runtime | Model | Quant / params | Context | Notes |
|---|---|---|---|---|---|---|
| 2026-05-10 | DGX Spark | llama-server CUDA over SSH tunnel | qwen3-coder-30b.gguf | Ollama GGUF blob, thinking disabled | 32K | Fastest measured Spark path; long-context stress completed. |
| 2026-05-10 | DGX Spark | Ollama over SSH tunnel | qwen3-coder:30b | Q4_K_M, 30.5B | 32K | Best current Spark Ollama default, but not the fastest Spark path. |
| 2026-05-10 | DGX Spark | Ollama over SSH tunnel | qwen3-coder-next:q4_K_M | Q4_K_M, 79.7B, think=false | 32K | Passed all daily-agent tasks; slower than qwen3-coder:30b. |
| 2026-05-10 | DGX Spark | Ollama over SSH tunnel | qwen3.6:35b | Q4_K_M, 36.0B, think=false | 32K | Slower; strict JSON failed under 256-token cap. |
| 2026-05-10 | M5 Max 128GB | ds4-server | deepseek-v4-flash | q2 local side engine | server default, 100K advertised | Fast enough on long-prefill; strict JSON unreliable. |
| 2026-05-10 | All current local targets | practical suites | ds4, Spark Ollama, Spark llama.cpp | built-in smoke/code/question/wiki tasks | 32K where applicable | Spark llama.cpp passed 14/14 with 1.15 s median wall time. |
4. Daily-agent task results
Median wall time below excludes the long-context stress task. Rows are sorted by total successful daily-agent attempts, then lower wall time.
| Model / runtime | exact-ok | short-tech | JSON plan | code-debug | Verdict |
|---|---|---|---|---|---|
qwen3-coder-30b.gguf / Spark llama.cpp | 114 ms, 3/3 | 1839 ms, 3/3 | 2632 ms, 3/3 | 1065 ms, 3/3 | Fastest measured Spark profile target. |
qwen3-coder:30b / Spark Ollama | 302 ms, 3/3 | 8894 ms, 3/3 | 13162 ms, 3/3 | 5496 ms, 3/3 | Best current Spark Ollama default; slower than llama.cpp. |
qwen3-coder-next:q4_K_M / Spark Ollama | 1093 ms, 3/3 | 13550 ms, 3/3 | 23970 ms, 3/3 | 12747 ms, 3/3 | Keep for quality tests, not latency. |
qwen3.6:35b / Spark Ollama | 969 ms, 3/3 | 22143 ms, 3/3 | 24795 ms, 0/3 | 7019 ms, 3/3 | Not current coding-agent default. |
deepseek-v4-flash / ds4 | 2049 ms, 3/3 | 11538 ms, 3/3 | 14585 ms, 0/3 | 14692 ms, 2/3 | Good Apple ds4 baseline; weak strict JSON. |
5. Practical suites
The expanded harness now runs four suites: smoke, code, question, and wiki. Rows are sorted by overall pass count, then lower median wall time.
| Stack | Model | Overall pass | Median wall | Total wall | Notes |
|---|---|---|---|---|---|
| Spark llama.cpp | qwen3-coder-30b.gguf | 14/14 | 1151 ms | 16.4 s | Fastest and cleanest current local profile target. |
| Spark Ollama | qwen3-coder:30b | 13/14 | 5471 ms | 337.9 s | Passed practical checks but timed out long-prefill at 240 s. |
| ds4 | deepseek-v4-flash | 9/14 | 5853 ms | 81.3 s | Failed strict JSON, citation, and abstention-style gates. |
| Stack | Smoke | Code | Question | Wiki |
|---|---|---|---|---|
| Spark llama.cpp | 5/5 | 3/3 | 3/3 | 3/3 |
| Spark Ollama | 4/5 | 3/3 | 3/3 | 3/3 |
| ds4 | 3/5 | 3/3 | 1/3 | 2/3 |
6. Cloud comparison lane
Gemini belongs in this registry as an explicit cloud lane beside the local stack. Run it from the M5, keep Spark focused on local CUDA serving, and publish cloud metrics separately from local token-throughput metrics.
| Date added | Profile | Runtime path | Model | Thinking | Use |
|---|---|---|---|---|---|
| 2026-05-10 | gemini-fast | Gemini CLI or OpenAI-compatible API from M5 | gemini-3-flash-preview | low | Fast cloud coding/chat comparator. |
| 2026-05-10 | gemini-pro | Gemini CLI or OpenAI-compatible API from M5 | gemini-3.1-pro-preview-customtools | medium | Hard coding-agent tasks with custom tools. |
| 2026-05-10 | gemini-pro-deep | Gemini CLI or API from M5 | gemini-3.1-pro-preview | high | Deep repo review, architecture, and wiki synthesis. |
| 2026-05-10 | gemini-lite | Gemini CLI or API from M5 | gemini-3.1-flash-lite | minimal or low | Routing, extraction, smoke tests, and lightweight chat. |
| Metric | Publish for Gemini | Why separate |
|---|---|---|
| Task outcome | Pass/fail, grader reason, final-state correctness | Closest match to local practical-suite scoring. |
| Latency | Wall time and first-token latency | Network/API time is part of user experience. |
| Tool behavior | Tool-call validity, malformed calls, stalls, drops | Agent reliability matters more than raw chat speed. |
| Cost/context | Cost per task, input tokens, output tokens, context used | Cloud comparison needs budget data that local runs do not. |
7. Spark token rates
Rows in each runtime table are sorted by median generation rate. Ollama response metrics use qwen3-coder:30b with OLLAMA_NUM_CTX=32768.
| Task | Median prefill | Median generation |
|---|---|---|
| exact-ok | 295.2 t/s | 71.1 t/s |
| short-technical | 214.8 t/s | 19.2 t/s |
| code-debug | 156.5 t/s | 18.9 t/s |
| json-plan | 177.2 t/s | 18.0 t/s |
llama.cpp response metrics for the same model bits copied into qwen3-coder-30b.gguf with 32K context.
| Task | Median prefill | Median generation |
|---|---|---|
| exact-ok | 226.6 t/s | 157.1 t/s |
| short-technical | 353.9 t/s | 91.5 t/s |
| code-debug | 66.1 t/s | 90.2 t/s |
| json-plan | 535.0 t/s | 90.1 t/s |
| long-prefill-summary | 2753.8 t/s | 64.3 t/s |
8. M5 Max 128GB reference data
The home AI wiki has a separate M5 Max 128GB thread. Keep these rows separate from the daily-agent registry because they mix local measurements, community MLX runs, and vendor claims.
Local measured ds4 on M5 Max 128GB
Measured ds4 rows are sorted by generation rate; tiny-output overhead rows are last.
| Run | Context / flags | Prefill | Generation | Notes |
|---|---|---|---|---|
| 256-token technical answer | --ctx 32768 --nothink --temp 0 -n 256 | 61.21 t/s | 38.57 t/s | Best local ds4 generation baseline. |
| Same prompt, warm weights | --warm-weights | 76.92 t/s | 37.81 t/s | Warm weights helped prefill, not decode. |
Short reply with ok | --ctx 32768 --nothink --temp 0 -n 16 | 28.96 t/s | 8.93 t/s | Tiny output; overhead dominates. |
| Layer | Observed warm latency | Interpretation |
|---|---|---|
| Raw ds4 Anthropic messages | ~121 ms | Endpoint-only tiny prompt. |
| Raw ds4 OpenAI chat | ~123 ms | Endpoint-only tiny prompt. |
claude-ds4 bare/no-tools | ~2.0 s | CLI overhead plus local endpoint. |
codex-ds4 coding profile | ~25-26 s | Full agent prompt plus tool schemas; tiny smoke prompt reported ~12,382 prompt tokens. |
Community MLX / GGUF data for M5 Max 128GB
Community rows are sorted by reported generation throughput. They differ in model, context, and source, so they should stay labeled rather than merged into the local harness ranking.
| Model / runtime | Context | Prefill | Generation | Peak memory | Source type |
|---|---|---|---|---|---|
| gpt-oss-120b-MXFP4-Q8 / MLX | 1K | - | 84.5 TG t/s | - | oMLX community |
| gpt-oss-120b-MXFP4-Q8 / MLX | 4K | - | 79.6 TG t/s | - | oMLX community |
| gpt-oss-120b-MXFP4-Q8 / MLX | 8K | - | 73.5 TG t/s | - | oMLX community |
| Qwen3-Coder-Next 4-bit / MLX | 32K | 2,434 PP t/s | 61.2 TG t/s | 46.5GB | oMLX community |
| MiniMax-M2.7-style 4-bit / MLX | 1K | - | 55-57 TG t/s | ~87GB | oMLX community |
| MiniMax-M2.7-style 4-bit / MLX | 8K | - | ~40 TG t/s | ~89GB | oMLX community |
| gpt-oss-120b-MXFP4-Q8 / MLX | 32K | 1,368 PP t/s | 39.0 TG t/s | 60.6GB | oMLX community |
| LLaMA 7B F16 / llama.cpp Metal | benchmark default | 1018.30 t/s | 37.58 t/s | - | llama.cpp community |
| MiniMax-M2.7-style 4-bit / MLX | 16K | - | 29-32 TG t/s | ~91GB | oMLX community |
| Qwen3-Coder-Next 4-bit / MLX | 64K | - | 28.7-29.9 TG t/s | - | oMLX community |
| MiniMax-M2.7-style 4-bit / MLX | 32K | - | 20-21 TG t/s | ~95GB | oMLX community |
| MiniMax-M2.7-style 4-bit / MLX | 64K | - | ~13.6 TG t/s | ~103GB | oMLX community |
Vendor / planning data
| Claim | Number | Use it for |
|---|---|---|
| Ollama MLX preview on Qwen3.5-35B-A3B | Prefill 1154 -> 1810 t/s, decode 58 -> 112 t/s; projected int4 1851 prefill / 134 decode | API-path target to beat with direct MLX. |
| Planning estimate: Qwen3.6-35B-A3B on M5 Max MLX | ~70-80 tok/s | Strategy estimate, not a local harness measurement. |
| Apple MLX 30B MoE on MacBook Pro | Sub-3-second TTFT | Interactive MoE expectation. |
| Planning estimate: 70B dense Q4 on M5 Max | ~12-20 tok/s | Usable chat, probably slow for coding-agent loops. |
| Apple MLX M5 vs M4 | Up to 4x TTFT; decode only 1.19-1.27x better | Explains why prefill improves more than generation. |
| M5 Max hardware baseline | 128GB unified memory, 614GB/s memory bandwidth | Capacity and decode-speed planning. |
External references for the non-local rows: Apple MLX M5 research, Ollama MLX preview, llama.cpp Apple Silicon benchmark thread, and oMLX community runs go1vd8aj, m1wd0ucw, r9m8lvr3.
9. Long-context status
Long-context behavior is not resolved for Spark Ollama yet. A long-prefill task timed out at 180 seconds with OLLAMA_NUM_CTX=32768; the expanded practical run timed out at 240 seconds. Before setting num_ctx, the model loaded with a 262144-token context and also stalled.
Spark llama.cpp completed the standalone long-prefill stress task in 5596 ms with 2753.8 prefill t/s. Apple ds4 completed the same stress task in the 20 second range, but only passed 1/3 because some outputs were too terse for the scoring rule.
{
"context_length": 32768,
"size": 21889382400,
"size_vram": 0
}
Verify Spark GPU placement directly before drawing conclusions from this result:
docker exec ollama ollama ps
nvidia-smi
10. Next runs
| Priority | Platform | Runtime | Model | Goal |
|---|---|---|---|---|
| 1 | M5 Max | MLX direct | Qwen3-Coder-Next 4-bit / 8-bit | Run the same smoke/code/question/wiki suites against Apple MLX. |
| 2 | DGX Spark | SGLang | Qwen3-Coder-Next / Qwen3-Next NVFP4 | Test optimized agent/code serving. |
| 3 | DGX Spark | vLLM | Qwen3.6 FP8/NVFP4 | Test OpenAI-compatible serving with prefix cache and FP8 KV. |
| 4 | DGX Spark | llama-server CUDA | Qwen3.6 GGUF or Coder-Next GGUF | Check whether larger GGUF candidates beat the current 30B default. |
| 5 | Cloud via M5 | Gemini CLI/API | Gemini 3 Flash and Gemini 3.1 Pro profiles | Run the same smoke/code/question/wiki suites as a cloud comparison lane. |
| 6 | M5 Max | ds4-server | DeepSeek V4 Flash | Re-run with expanded structured-output and wiki tasks after settings changes. |
11. Public scripts
The public bundle includes the benchmark runner, wrappers, tunnel examples, and a redaction check. It intentionally omits raw run logs, SSH known-hosts files, LAN addresses, MAC addresses, hostnames, usernames, and private paths.
| File | Purpose |
|---|---|
README.md | Setup, endpoint defaults, tunnel usage, and publishing hygiene. |
bench.py | Benchmark runner for ds4, Spark Ollama, and llama.cpp. |
run-spark.sh | Spark Ollama wrapper. |
run-ds4.sh | Apple ds4 wrapper. |
run-llama.sh | llama.cpp wrapper. |
run-both.sh | ds4 plus Spark Ollama wrapper. |
run-practical-all.sh | All built-in suites across ds4, Spark Ollama, and Spark llama.cpp. |
pull-ollama.py | Pull an Ollama model through the configured local tunnel. |
spark-tunnel.example.sh | Placeholder SSH tunnel for remote Ollama. |
spark-llama-tunnel.example.sh | Placeholder SSH tunnel for remote llama.cpp. |
spark-llama-cpp-server.sh | Host-side helper for building and launching llama-server. |
redaction-check.sh | Scan generated files before publishing. |
MANIFEST.txt | Bundle file list. |
bench.py redacts non-local endpoint hosts in runs.jsonl. Use --show-endpoints only for private notes.
12. Reproduce
Download the public bundle into a bench-local-ai folder, then run the same dated task set. Keep the generated summary public and run the redaction check before publishing raw records.
mkdir -p bench-local-ai
cd bench-local-ai
base='https://learntoprompt.org/downloads/bench-local-ai'
for file in \
bench.py pull-ollama.py \
run-ds4.sh run-spark.sh run-llama.sh run-both.sh \
run-practical-all.sh \
spark-tunnel.example.sh spark-llama-tunnel.example.sh \
spark-llama-cpp-server.sh redaction-check.sh; do
curl -fsSLO "$base/$file"
done
chmod +x *.py *.sh
# Spark Ollama daily-agent run
SPARK_MODEL='qwen3-coder:30b' ./run-spark.sh 3 --exclude-kind long-context --timeout 90
# Spark candidate run
SPARK_MODEL='qwen3-coder-next:q4_K_M' ./run-spark.sh 3 --exclude-kind long-context --timeout 180
# Spark llama.cpp current default
LLAMA_MODEL='qwen3-coder-30b.gguf' ./run-llama.sh 3 --exclude-kind long-context --timeout 180
# Full practical suite
LLAMA_MODEL='qwen3-coder-30b.gguf' ./run-practical-all.sh 1 --timeout 240
# Apple ds4 run
./run-ds4.sh 3 --timeout 240
# Check generated files before publishing
./redaction-check.sh results
For every new run, publish the date, hardware, runtime, exact model tag, quantization, context, thinking mode, task set, pass rate, median wall time, and the reason the profile was promoted or rejected. For cloud comparison runs, also publish first-token latency, cost per task, and whether credentials were loaded only from the environment or provider auth.