local ai benchmarks

Dated local-model runs and cloud comparison profiles with hardware, runtime, model, quantization, context, pass rate, and next experiments.

Last updated

Local Inference Benchmarks

1. Current snapshot

This page tracks measured local AI runs. It is not a vendor leaderboard. Each row should keep the date, hardware, runtime, model, quantization, context, pass rate, and exact benchmark conditions together so progress is visible over time. Cloud profiles are tracked as comparison lanes, not local hardware lanes. Comparable result tables are sorted by best performer first.

Lane / profileWhyStatus
qwen3-coder-30b.gguf on DGX Spark llama.cppFastest measured Spark path; passed 14/14 across smoke, code, question, and wiki suites.Default Spark profile target
qwen3-coder:30b on DGX Spark OllamaPassed practical checks except long-prefill timeout, but is much slower than llama.cpp for the same model bits.Model management / fallback
qwen3-coder-next:q4_K_M on DGX Spark OllamaPassed all daily-agent tasks but is slower. Keep it for harder coding-quality tests.Quality candidate
deepseek-v4-flash through ds4 on M5 MaxGood Apple-side side engine, but weak strict JSON, citation, and abstention gates in this harness.Side engine
gemini-fast and gemini-pro via M5Cloud comparison and escalation lanes for the same practical tasks. Track wall time, tool validity, context used, and cost separately.Comparator, not local inference
Current caveat: Spark Ollama reported size_vram=0 in /api/ps. Treat Ollama as the convenience/model-management path until Spark shell verification and direct llama.cpp/SGLang/vLLM runs cover the same models.

2. How to read the numbers

The benchmark separates three layers. Raw tokens per second is useful, but local coding agents should be judged on task success first.

LayerMeasuresUse for
EnginePrefill tok/s, generation tok/s, memory, context/KV limitsMLX, ds4, llama.cpp tuning
Server/APIEnd-to-end latency, TTFT, cache behavior, HTTP errorsOllama, ds4-server, mlx_lm.server
Agent loopPass rate, wall time, tool validity, stalls, files/testsPi, Claude, Codex, OpenAI-compatible side profiles
Cloud comparatorPass rate, wall time, TTFT, cost/task, tool validity, context usedGemini side profiles measured beside local profiles

Promote a profile only when it improves correctness, tool-call validity, and stall behavior. A fast chat response is not enough. Do not compare Gemini token/s directly against ds4 or Spark CUDA as if all three were local hardware runtimes.

3. Run registry

Individual measured profile rows are ordered by current benchmark standing. Aggregate run rows stay last because they describe a task set, not one performer.

DateHardwareRuntimeModelQuant / paramsContextNotes
2026-05-10DGX Sparkllama-server CUDA over SSH tunnelqwen3-coder-30b.ggufOllama GGUF blob, thinking disabled32KFastest measured Spark path; long-context stress completed.
2026-05-10DGX SparkOllama over SSH tunnelqwen3-coder:30bQ4_K_M, 30.5B32KBest current Spark Ollama default, but not the fastest Spark path.
2026-05-10DGX SparkOllama over SSH tunnelqwen3-coder-next:q4_K_MQ4_K_M, 79.7B, think=false32KPassed all daily-agent tasks; slower than qwen3-coder:30b.
2026-05-10DGX SparkOllama over SSH tunnelqwen3.6:35bQ4_K_M, 36.0B, think=false32KSlower; strict JSON failed under 256-token cap.
2026-05-10M5 Max 128GBds4-serverdeepseek-v4-flashq2 local side engineserver default, 100K advertisedFast enough on long-prefill; strict JSON unreliable.
2026-05-10All current local targetspractical suitesds4, Spark Ollama, Spark llama.cppbuilt-in smoke/code/question/wiki tasks32K where applicableSpark llama.cpp passed 14/14 with 1.15 s median wall time.

4. Daily-agent task results

Median wall time below excludes the long-context stress task. Rows are sorted by total successful daily-agent attempts, then lower wall time.

Model / runtimeexact-okshort-techJSON plancode-debugVerdict
qwen3-coder-30b.gguf / Spark llama.cpp114 ms, 3/31839 ms, 3/32632 ms, 3/31065 ms, 3/3Fastest measured Spark profile target.
qwen3-coder:30b / Spark Ollama302 ms, 3/38894 ms, 3/313162 ms, 3/35496 ms, 3/3Best current Spark Ollama default; slower than llama.cpp.
qwen3-coder-next:q4_K_M / Spark Ollama1093 ms, 3/313550 ms, 3/323970 ms, 3/312747 ms, 3/3Keep for quality tests, not latency.
qwen3.6:35b / Spark Ollama969 ms, 3/322143 ms, 3/324795 ms, 0/37019 ms, 3/3Not current coding-agent default.
deepseek-v4-flash / ds42049 ms, 3/311538 ms, 3/314585 ms, 0/314692 ms, 2/3Good Apple ds4 baseline; weak strict JSON.

5. Practical suites

The expanded harness now runs four suites: smoke, code, question, and wiki. Rows are sorted by overall pass count, then lower median wall time.

StackModelOverall passMedian wallTotal wallNotes
Spark llama.cppqwen3-coder-30b.gguf14/141151 ms16.4 sFastest and cleanest current local profile target.
Spark Ollamaqwen3-coder:30b13/145471 ms337.9 sPassed practical checks but timed out long-prefill at 240 s.
ds4deepseek-v4-flash9/145853 ms81.3 sFailed strict JSON, citation, and abstention-style gates.
StackSmokeCodeQuestionWiki
Spark llama.cpp5/53/33/33/3
Spark Ollama4/53/33/33/3
ds43/53/31/32/3
Promotion rule: use pass rate first, then stall-free behavior, tool validity, p95 wall time, and token rates. Do not promote a profile from chat speed alone.

6. Cloud comparison lane

Gemini belongs in this registry as an explicit cloud lane beside the local stack. Run it from the M5, keep Spark focused on local CUDA serving, and publish cloud metrics separately from local token-throughput metrics.

Date addedProfileRuntime pathModelThinkingUse
2026-05-10gemini-fastGemini CLI or OpenAI-compatible API from M5gemini-3-flash-previewlowFast cloud coding/chat comparator.
2026-05-10gemini-proGemini CLI or OpenAI-compatible API from M5gemini-3.1-pro-preview-customtoolsmediumHard coding-agent tasks with custom tools.
2026-05-10gemini-pro-deepGemini CLI or API from M5gemini-3.1-pro-previewhighDeep repo review, architecture, and wiki synthesis.
2026-05-10gemini-liteGemini CLI or API from M5gemini-3.1-flash-liteminimal or lowRouting, extraction, smoke tests, and lightweight chat.
MetricPublish for GeminiWhy separate
Task outcomePass/fail, grader reason, final-state correctnessClosest match to local practical-suite scoring.
LatencyWall time and first-token latencyNetwork/API time is part of user experience.
Tool behaviorTool-call validity, malformed calls, stalls, dropsAgent reliability matters more than raw chat speed.
Cost/contextCost per task, input tokens, output tokens, context usedCloud comparison needs budget data that local runs do not.
Publishing rule: keep Gemini credentials in environment variables or provider auth only. Public pages and scripts should name the profile, model, thinking setting, and metrics, but never API keys, project IDs, account IDs, hostnames, LAN addresses, or private paths.

7. Spark token rates

Rows in each runtime table are sorted by median generation rate. Ollama response metrics use qwen3-coder:30b with OLLAMA_NUM_CTX=32768.

TaskMedian prefillMedian generation
exact-ok295.2 t/s71.1 t/s
short-technical214.8 t/s19.2 t/s
code-debug156.5 t/s18.9 t/s
json-plan177.2 t/s18.0 t/s

llama.cpp response metrics for the same model bits copied into qwen3-coder-30b.gguf with 32K context.

TaskMedian prefillMedian generation
exact-ok226.6 t/s157.1 t/s
short-technical353.9 t/s91.5 t/s
code-debug66.1 t/s90.2 t/s
json-plan535.0 t/s90.1 t/s
long-prefill-summary2753.8 t/s64.3 t/s

8. M5 Max 128GB reference data

The home AI wiki has a separate M5 Max 128GB thread. Keep these rows separate from the daily-agent registry because they mix local measurements, community MLX runs, and vendor claims.

Local measured ds4 on M5 Max 128GB

Measured ds4 rows are sorted by generation rate; tiny-output overhead rows are last.

RunContext / flagsPrefillGenerationNotes
256-token technical answer--ctx 32768 --nothink --temp 0 -n 25661.21 t/s38.57 t/sBest local ds4 generation baseline.
Same prompt, warm weights--warm-weights76.92 t/s37.81 t/sWarm weights helped prefill, not decode.
Short reply with ok--ctx 32768 --nothink --temp 0 -n 1628.96 t/s8.93 t/sTiny output; overhead dominates.
LayerObserved warm latencyInterpretation
Raw ds4 Anthropic messages~121 msEndpoint-only tiny prompt.
Raw ds4 OpenAI chat~123 msEndpoint-only tiny prompt.
claude-ds4 bare/no-tools~2.0 sCLI overhead plus local endpoint.
codex-ds4 coding profile~25-26 sFull agent prompt plus tool schemas; tiny smoke prompt reported ~12,382 prompt tokens.

Community MLX / GGUF data for M5 Max 128GB

Community rows are sorted by reported generation throughput. They differ in model, context, and source, so they should stay labeled rather than merged into the local harness ranking.

Model / runtimeContextPrefillGenerationPeak memorySource type
gpt-oss-120b-MXFP4-Q8 / MLX1K-84.5 TG t/s-oMLX community
gpt-oss-120b-MXFP4-Q8 / MLX4K-79.6 TG t/s-oMLX community
gpt-oss-120b-MXFP4-Q8 / MLX8K-73.5 TG t/s-oMLX community
Qwen3-Coder-Next 4-bit / MLX32K2,434 PP t/s61.2 TG t/s46.5GBoMLX community
MiniMax-M2.7-style 4-bit / MLX1K-55-57 TG t/s~87GBoMLX community
MiniMax-M2.7-style 4-bit / MLX8K-~40 TG t/s~89GBoMLX community
gpt-oss-120b-MXFP4-Q8 / MLX32K1,368 PP t/s39.0 TG t/s60.6GBoMLX community
LLaMA 7B F16 / llama.cpp Metalbenchmark default1018.30 t/s37.58 t/s-llama.cpp community
MiniMax-M2.7-style 4-bit / MLX16K-29-32 TG t/s~91GBoMLX community
Qwen3-Coder-Next 4-bit / MLX64K-28.7-29.9 TG t/s-oMLX community
MiniMax-M2.7-style 4-bit / MLX32K-20-21 TG t/s~95GBoMLX community
MiniMax-M2.7-style 4-bit / MLX64K-~13.6 TG t/s~103GBoMLX community

Vendor / planning data

ClaimNumberUse it for
Ollama MLX preview on Qwen3.5-35B-A3BPrefill 1154 -> 1810 t/s, decode 58 -> 112 t/s; projected int4 1851 prefill / 134 decodeAPI-path target to beat with direct MLX.
Planning estimate: Qwen3.6-35B-A3B on M5 Max MLX~70-80 tok/sStrategy estimate, not a local harness measurement.
Apple MLX 30B MoE on MacBook ProSub-3-second TTFTInteractive MoE expectation.
Planning estimate: 70B dense Q4 on M5 Max~12-20 tok/sUsable chat, probably slow for coding-agent loops.
Apple MLX M5 vs M4Up to 4x TTFT; decode only 1.19-1.27x betterExplains why prefill improves more than generation.
M5 Max hardware baseline128GB unified memory, 614GB/s memory bandwidthCapacity and decode-speed planning.
Rule for public updates: label rows as local measured, community, vendor, or planning estimate. Do not mix them into one leaderboard.

External references for the non-local rows: Apple MLX M5 research, Ollama MLX preview, llama.cpp Apple Silicon benchmark thread, and oMLX community runs go1vd8aj, m1wd0ucw, r9m8lvr3.

9. Long-context status

Long-context behavior is not resolved for Spark Ollama yet. A long-prefill task timed out at 180 seconds with OLLAMA_NUM_CTX=32768; the expanded practical run timed out at 240 seconds. Before setting num_ctx, the model loaded with a 262144-token context and also stalled.

Spark llama.cpp completed the standalone long-prefill stress task in 5596 ms with 2753.8 prefill t/s. Apple ds4 completed the same stress task in the 20 second range, but only passed 1/3 because some outputs were too terse for the scoring rule.

{
  "context_length": 32768,
  "size": 21889382400,
  "size_vram": 0
}

Verify Spark GPU placement directly before drawing conclusions from this result:

docker exec ollama ollama ps
nvidia-smi

10. Next runs

PriorityPlatformRuntimeModelGoal
1M5 MaxMLX directQwen3-Coder-Next 4-bit / 8-bitRun the same smoke/code/question/wiki suites against Apple MLX.
2DGX SparkSGLangQwen3-Coder-Next / Qwen3-Next NVFP4Test optimized agent/code serving.
3DGX SparkvLLMQwen3.6 FP8/NVFP4Test OpenAI-compatible serving with prefix cache and FP8 KV.
4DGX Sparkllama-server CUDAQwen3.6 GGUF or Coder-Next GGUFCheck whether larger GGUF candidates beat the current 30B default.
5Cloud via M5Gemini CLI/APIGemini 3 Flash and Gemini 3.1 Pro profilesRun the same smoke/code/question/wiki suites as a cloud comparison lane.
6M5 Maxds4-serverDeepSeek V4 FlashRe-run with expanded structured-output and wiki tasks after settings changes.

11. Public scripts

The public bundle includes the benchmark runner, wrappers, tunnel examples, and a redaction check. It intentionally omits raw run logs, SSH known-hosts files, LAN addresses, MAC addresses, hostnames, usernames, and private paths.

FilePurpose
README.mdSetup, endpoint defaults, tunnel usage, and publishing hygiene.
bench.pyBenchmark runner for ds4, Spark Ollama, and llama.cpp.
run-spark.shSpark Ollama wrapper.
run-ds4.shApple ds4 wrapper.
run-llama.shllama.cpp wrapper.
run-both.shds4 plus Spark Ollama wrapper.
run-practical-all.shAll built-in suites across ds4, Spark Ollama, and Spark llama.cpp.
pull-ollama.pyPull an Ollama model through the configured local tunnel.
spark-tunnel.example.shPlaceholder SSH tunnel for remote Ollama.
spark-llama-tunnel.example.shPlaceholder SSH tunnel for remote llama.cpp.
spark-llama-cpp-server.shHost-side helper for building and launching llama-server.
redaction-check.shScan generated files before publishing.
MANIFEST.txtBundle file list.
Privacy default: bench.py redacts non-local endpoint hosts in runs.jsonl. Use --show-endpoints only for private notes.

12. Reproduce

Download the public bundle into a bench-local-ai folder, then run the same dated task set. Keep the generated summary public and run the redaction check before publishing raw records.

mkdir -p bench-local-ai
cd bench-local-ai
base='https://learntoprompt.org/downloads/bench-local-ai'
for file in \
  bench.py pull-ollama.py \
          run-ds4.sh run-spark.sh run-llama.sh run-both.sh \
  run-practical-all.sh \
  spark-tunnel.example.sh spark-llama-tunnel.example.sh \
  spark-llama-cpp-server.sh redaction-check.sh; do
  curl -fsSLO "$base/$file"
done
chmod +x *.py *.sh
# Spark Ollama daily-agent run
SPARK_MODEL='qwen3-coder:30b' ./run-spark.sh 3 --exclude-kind long-context --timeout 90

# Spark candidate run
SPARK_MODEL='qwen3-coder-next:q4_K_M' ./run-spark.sh 3 --exclude-kind long-context --timeout 180

# Spark llama.cpp current default
LLAMA_MODEL='qwen3-coder-30b.gguf' ./run-llama.sh 3 --exclude-kind long-context --timeout 180

# Full practical suite
LLAMA_MODEL='qwen3-coder-30b.gguf' ./run-practical-all.sh 1 --timeout 240

# Apple ds4 run
./run-ds4.sh 3 --timeout 240

# Check generated files before publishing
./redaction-check.sh results

For every new run, publish the date, hardware, runtime, exact model tag, quantization, context, thinking mode, task set, pass rate, median wall time, and the reason the profile was promoted or rejected. For cloud comparison runs, also publish first-token latency, cost per task, and whether credentials were loaded only from the environment or provider auth.