Local AI Benchmarks - Dates, Models, Parameters

1. Current snapshot

This page tracks measured local AI runs. It is not a vendor leaderboard. Each row should keep the date, hardware, runtime, model, quantization, context, pass rate, and exact benchmark conditions together so progress is visible over time. Cloud profiles are tracked as comparison lanes, not local hardware lanes. Comparable result tables are sorted by best performer first.

Lane / profile	Why	Status
`qwen3-coder-30b.gguf` on DGX Spark llama.cpp	Fastest measured Spark path; passed 14/14 across smoke, code, question, and wiki suites.	Default Spark profile target
`qwen3-coder:30b` on DGX Spark Ollama	Passed practical checks except long-prefill timeout, but is much slower than llama.cpp for the same model bits.	Model management / fallback
`qwen3-coder-next:q4_K_M` on DGX Spark Ollama	Passed all daily-agent tasks but is slower. Keep it for harder coding-quality tests.	Quality candidate
`deepseek-v4-flash` through ds4 on M5 Max	Good Apple-side side engine, but weak strict JSON, citation, and abstention gates in this harness.	Side engine
`gemini-fast` and `gemini-pro` via M5	Cloud comparison and escalation lanes for the same practical tasks. Track wall time, tool validity, context used, and cost separately.	Comparator, not local inference

Current caveat: Spark Ollama reported size_vram=0 in /api/ps. Treat Ollama as the convenience/model-management path until Spark shell verification and direct llama.cpp/SGLang/vLLM runs cover the same models.

2. How to read the numbers

The benchmark separates three layers. Raw tokens per second is useful, but local coding agents should be judged on task success first.

Layer	Measures	Use for
Engine	Prefill tok/s, generation tok/s, memory, context/KV limits	MLX, ds4, llama.cpp tuning
Server/API	End-to-end latency, TTFT, cache behavior, HTTP errors	Ollama, ds4-server, mlx_lm.server
Agent loop	Pass rate, wall time, tool validity, stalls, files/tests	Pi, Claude, Codex, OpenAI-compatible side profiles
Cloud comparator	Pass rate, wall time, TTFT, cost/task, tool validity, context used	Gemini side profiles measured beside local profiles

Promote a profile only when it improves correctness, tool-call validity, and stall behavior. A fast chat response is not enough. Do not compare Gemini token/s directly against ds4 or Spark CUDA as if all three were local hardware runtimes.

3. Run registry

Individual measured profile rows are ordered by current benchmark standing. Aggregate run rows stay last because they describe a task set, not one performer.

Date	Hardware	Runtime	Model	Quant / params	Context	Notes
2026-05-10	DGX Spark	`llama-server` CUDA over SSH tunnel	`qwen3-coder-30b.gguf`	Ollama GGUF blob, thinking disabled	32K	Fastest measured Spark path; long-context stress completed.
2026-05-10	DGX Spark	Ollama over SSH tunnel	`qwen3-coder:30b`	Q4_K_M, 30.5B	32K	Best current Spark Ollama default, but not the fastest Spark path.
2026-05-10	DGX Spark	Ollama over SSH tunnel	`qwen3-coder-next:q4_K_M`	Q4_K_M, 79.7B, `think=false`	32K	Passed all daily-agent tasks; slower than `qwen3-coder:30b`.
2026-05-10	DGX Spark	Ollama over SSH tunnel	`qwen3.6:35b`	Q4_K_M, 36.0B, `think=false`	32K	Slower; strict JSON failed under 256-token cap.
2026-05-10	M5 Max 128GB	ds4-server	`deepseek-v4-flash`	q2 local side engine	server default, 100K advertised	Fast enough on long-prefill; strict JSON unreliable.
2026-05-10	All current local targets	practical suites	ds4, Spark Ollama, Spark llama.cpp	built-in smoke/code/question/wiki tasks	32K where applicable	Spark llama.cpp passed 14/14 with 1.15 s median wall time.

4. Daily-agent task results

Median wall time below excludes the long-context stress task. Rows are sorted by total successful daily-agent attempts, then lower wall time.

Model / runtime	exact-ok	short-tech	JSON plan	code-debug	Verdict
`qwen3-coder-30b.gguf` / Spark llama.cpp	114 ms, 3/3	1839 ms, 3/3	2632 ms, 3/3	1065 ms, 3/3	Fastest measured Spark profile target.
`qwen3-coder:30b` / Spark Ollama	302 ms, 3/3	8894 ms, 3/3	13162 ms, 3/3	5496 ms, 3/3	Best current Spark Ollama default; slower than llama.cpp.
`qwen3-coder-next:q4_K_M` / Spark Ollama	1093 ms, 3/3	13550 ms, 3/3	23970 ms, 3/3	12747 ms, 3/3	Keep for quality tests, not latency.
`qwen3.6:35b` / Spark Ollama	969 ms, 3/3	22143 ms, 3/3	24795 ms, 0/3	7019 ms, 3/3	Not current coding-agent default.
`deepseek-v4-flash` / ds4	2049 ms, 3/3	11538 ms, 3/3	14585 ms, 0/3	14692 ms, 2/3	Good Apple ds4 baseline; weak strict JSON.

5. Practical suites

The expanded harness now runs four suites: smoke, code, question, and wiki. Rows are sorted by overall pass count, then lower median wall time.

Stack	Model	Overall pass	Median wall	Total wall	Notes
Spark llama.cpp	`qwen3-coder-30b.gguf`	14/14	1151 ms	16.4 s	Fastest and cleanest current local profile target.
Spark Ollama	`qwen3-coder:30b`	13/14	5471 ms	337.9 s	Passed practical checks but timed out long-prefill at 240 s.
ds4	`deepseek-v4-flash`	9/14	5853 ms	81.3 s	Failed strict JSON, citation, and abstention-style gates.

Stack	Smoke	Code	Question	Wiki
Spark llama.cpp	5/5	3/3	3/3	3/3
Spark Ollama	4/5	3/3	3/3	3/3
ds4	3/5	3/3	1/3	2/3

Promotion rule: use pass rate first, then stall-free behavior, tool validity, p95 wall time, and token rates. Do not promote a profile from chat speed alone.

6. Cloud comparison lane

Gemini belongs in this registry as an explicit cloud lane beside the local stack. Run it from the M5, keep Spark focused on local CUDA serving, and publish cloud metrics separately from local token-throughput metrics.

Date added	Profile	Runtime path	Model	Thinking	Use
2026-05-10	`gemini-fast`	Gemini CLI or OpenAI-compatible API from M5	`gemini-3-flash-preview`	`low`	Fast cloud coding/chat comparator.
2026-05-10	`gemini-pro`	Gemini CLI or OpenAI-compatible API from M5	`gemini-3.1-pro-preview-customtools`	`medium`	Hard coding-agent tasks with custom tools.
2026-05-10	`gemini-pro-deep`	Gemini CLI or API from M5	`gemini-3.1-pro-preview`	`high`	Deep repo review, architecture, and wiki synthesis.
2026-05-10	`gemini-lite`	Gemini CLI or API from M5	`gemini-3.1-flash-lite`	`minimal` or `low`	Routing, extraction, smoke tests, and lightweight chat.

Metric	Publish for Gemini	Why separate
Task outcome	Pass/fail, grader reason, final-state correctness	Closest match to local practical-suite scoring.
Latency	Wall time and first-token latency	Network/API time is part of user experience.
Tool behavior	Tool-call validity, malformed calls, stalls, drops	Agent reliability matters more than raw chat speed.
Cost/context	Cost per task, input tokens, output tokens, context used	Cloud comparison needs budget data that local runs do not.

Publishing rule: keep Gemini credentials in environment variables or provider auth only. Public pages and scripts should name the profile, model, thinking setting, and metrics, but never API keys, project IDs, account IDs, hostnames, LAN addresses, or private paths.

7. Spark token rates

Rows in each runtime table are sorted by median generation rate. Ollama response metrics use qwen3-coder:30b with OLLAMA_NUM_CTX=32768.

Task	Median prefill	Median generation
exact-ok	295.2 t/s	71.1 t/s
short-technical	214.8 t/s	19.2 t/s
code-debug	156.5 t/s	18.9 t/s
json-plan	177.2 t/s	18.0 t/s

llama.cpp response metrics for the same model bits copied into qwen3-coder-30b.gguf with 32K context.

Task	Median prefill	Median generation
exact-ok	226.6 t/s	157.1 t/s
short-technical	353.9 t/s	91.5 t/s
code-debug	66.1 t/s	90.2 t/s
json-plan	535.0 t/s	90.1 t/s
long-prefill-summary	2753.8 t/s	64.3 t/s

8. M5 Max 128GB reference data

The home AI wiki has a separate M5 Max 128GB thread. Keep these rows separate from the daily-agent registry because they mix local measurements, community MLX runs, and vendor claims.

Local measured ds4 on M5 Max 128GB

Measured ds4 rows are sorted by generation rate; tiny-output overhead rows are last.

Run	Context / flags	Prefill	Generation	Notes
256-token technical answer	`--ctx 32768 --nothink --temp 0 -n 256`	61.21 t/s	38.57 t/s	Best local ds4 generation baseline.
Same prompt, warm weights	`--warm-weights`	76.92 t/s	37.81 t/s	Warm weights helped prefill, not decode.
Short `reply with ok`	`--ctx 32768 --nothink --temp 0 -n 16`	28.96 t/s	8.93 t/s	Tiny output; overhead dominates.

Layer	Observed warm latency	Interpretation
Raw ds4 Anthropic messages	~121 ms	Endpoint-only tiny prompt.
Raw ds4 OpenAI chat	~123 ms	Endpoint-only tiny prompt.
`claude-ds4` bare/no-tools	~2.0 s	CLI overhead plus local endpoint.
`codex-ds4` coding profile	~25-26 s	Full agent prompt plus tool schemas; tiny smoke prompt reported ~12,382 prompt tokens.

Community MLX / GGUF data for M5 Max 128GB

Community rows are sorted by reported generation throughput. They differ in model, context, and source, so they should stay labeled rather than merged into the local harness ranking.

Model / runtime	Context	Prefill	Generation	Peak memory	Source type
gpt-oss-120b-MXFP4-Q8 / MLX	1K	-	84.5 TG t/s	-	oMLX community
gpt-oss-120b-MXFP4-Q8 / MLX	4K	-	79.6 TG t/s	-	oMLX community
gpt-oss-120b-MXFP4-Q8 / MLX	8K	-	73.5 TG t/s	-	oMLX community
Qwen3-Coder-Next 4-bit / MLX	32K	2,434 PP t/s	61.2 TG t/s	46.5GB	oMLX community
MiniMax-M2.7-style 4-bit / MLX	1K	-	55-57 TG t/s	~87GB	oMLX community
MiniMax-M2.7-style 4-bit / MLX	8K	-	~40 TG t/s	~89GB	oMLX community
gpt-oss-120b-MXFP4-Q8 / MLX	32K	1,368 PP t/s	39.0 TG t/s	60.6GB	oMLX community
LLaMA 7B F16 / llama.cpp Metal	benchmark default	1018.30 t/s	37.58 t/s	-	llama.cpp community
MiniMax-M2.7-style 4-bit / MLX	16K	-	29-32 TG t/s	~91GB	oMLX community
Qwen3-Coder-Next 4-bit / MLX	64K	-	28.7-29.9 TG t/s	-	oMLX community
MiniMax-M2.7-style 4-bit / MLX	32K	-	20-21 TG t/s	~95GB	oMLX community
MiniMax-M2.7-style 4-bit / MLX	64K	-	~13.6 TG t/s	~103GB	oMLX community

Vendor / planning data

Claim	Number	Use it for
Ollama MLX preview on Qwen3.5-35B-A3B	Prefill 1154 -> 1810 t/s, decode 58 -> 112 t/s; projected int4 1851 prefill / 134 decode	API-path target to beat with direct MLX.
Planning estimate: Qwen3.6-35B-A3B on M5 Max MLX	~70-80 tok/s	Strategy estimate, not a local harness measurement.
Apple MLX 30B MoE on MacBook Pro	Sub-3-second TTFT	Interactive MoE expectation.
Planning estimate: 70B dense Q4 on M5 Max	~12-20 tok/s	Usable chat, probably slow for coding-agent loops.
Apple MLX M5 vs M4	Up to 4x TTFT; decode only 1.19-1.27x better	Explains why prefill improves more than generation.
M5 Max hardware baseline	128GB unified memory, 614GB/s memory bandwidth	Capacity and decode-speed planning.

Rule for public updates: label rows as local measured, community, vendor, or planning estimate. Do not mix them into one leaderboard.

External references for the non-local rows: Apple MLX M5 research, Ollama MLX preview, llama.cpp Apple Silicon benchmark thread, and oMLX community runs go1vd8aj, m1wd0ucw, r9m8lvr3.

9. Long-context status

Long-context behavior is not resolved for Spark Ollama yet. A long-prefill task timed out at 180 seconds with OLLAMA_NUM_CTX=32768; the expanded practical run timed out at 240 seconds. Before setting num_ctx, the model loaded with a 262144-token context and also stalled.

Spark llama.cpp completed the standalone long-prefill stress task in 5596 ms with 2753.8 prefill t/s. Apple ds4 completed the same stress task in the 20 second range, but only passed 1/3 because some outputs were too terse for the scoring rule.

{
  "context_length": 32768,
  "size": 21889382400,
  "size_vram": 0
}

Verify Spark GPU placement directly before drawing conclusions from this result:

docker exec ollama ollama ps
nvidia-smi

10. Next runs

Priority	Platform	Runtime	Model	Goal
1	M5 Max	MLX direct	Qwen3-Coder-Next 4-bit / 8-bit	Run the same smoke/code/question/wiki suites against Apple MLX.
2	DGX Spark	SGLang	Qwen3-Coder-Next / Qwen3-Next NVFP4	Test optimized agent/code serving.
3	DGX Spark	vLLM	Qwen3.6 FP8/NVFP4	Test OpenAI-compatible serving with prefix cache and FP8 KV.
4	DGX Spark	`llama-server` CUDA	Qwen3.6 GGUF or Coder-Next GGUF	Check whether larger GGUF candidates beat the current 30B default.
5	Cloud via M5	Gemini CLI/API	Gemini 3 Flash and Gemini 3.1 Pro profiles	Run the same smoke/code/question/wiki suites as a cloud comparison lane.
6	M5 Max	ds4-server	DeepSeek V4 Flash	Re-run with expanded structured-output and wiki tasks after settings changes.

11. Public scripts

The public bundle includes the benchmark runner, wrappers, tunnel examples, and a redaction check. It intentionally omits raw run logs, SSH known-hosts files, LAN addresses, MAC addresses, hostnames, usernames, and private paths.

File	Purpose
`README.md`	Setup, endpoint defaults, tunnel usage, and publishing hygiene.
`bench.py`	Benchmark runner for ds4, Spark Ollama, and llama.cpp.
`run-spark.sh`	Spark Ollama wrapper.
`run-ds4.sh`	Apple ds4 wrapper.
`run-llama.sh`	llama.cpp wrapper.
`run-both.sh`	ds4 plus Spark Ollama wrapper.
`run-practical-all.sh`	All built-in suites across ds4, Spark Ollama, and Spark llama.cpp.
`pull-ollama.py`	Pull an Ollama model through the configured local tunnel.
`spark-tunnel.example.sh`	Placeholder SSH tunnel for remote Ollama.
`spark-llama-tunnel.example.sh`	Placeholder SSH tunnel for remote llama.cpp.
`spark-llama-cpp-server.sh`	Host-side helper for building and launching `llama-server`.
`redaction-check.sh`	Scan generated files before publishing.
`MANIFEST.txt`	Bundle file list.

Privacy default: bench.py redacts non-local endpoint hosts in runs.jsonl. Use --show-endpoints only for private notes.

12. Reproduce

Download the public bundle into a bench-local-ai folder, then run the same dated task set. Keep the generated summary public and run the redaction check before publishing raw records.

mkdir -p bench-local-ai
cd bench-local-ai
base='https://learntoprompt.org/downloads/bench-local-ai'
for file in \
  bench.py pull-ollama.py \
          run-ds4.sh run-spark.sh run-llama.sh run-both.sh \
  run-practical-all.sh \
  spark-tunnel.example.sh spark-llama-tunnel.example.sh \
  spark-llama-cpp-server.sh redaction-check.sh; do
  curl -fsSLO "$base/$file"
done
chmod +x *.py *.sh

# Spark Ollama daily-agent run
SPARK_MODEL='qwen3-coder:30b' ./run-spark.sh 3 --exclude-kind long-context --timeout 90

# Spark candidate run
SPARK_MODEL='qwen3-coder-next:q4_K_M' ./run-spark.sh 3 --exclude-kind long-context --timeout 180

# Spark llama.cpp current default
LLAMA_MODEL='qwen3-coder-30b.gguf' ./run-llama.sh 3 --exclude-kind long-context --timeout 180

# Full practical suite
LLAMA_MODEL='qwen3-coder-30b.gguf' ./run-practical-all.sh 1 --timeout 240

# Apple ds4 run
./run-ds4.sh 3 --timeout 240

# Check generated files before publishing
./redaction-check.sh results

For every new run, publish the date, hardware, runtime, exact model tag, quantization, context, thinking mode, task set, pass rate, median wall time, and the reason the profile was promoted or rejected. For cloud comparison runs, also publish first-token latency, cost per task, and whether credentials were loaded only from the environment or provider auth.