ds4-agent vs Codex Frontier

1. Bottom line

I could not find a public apples-to-apples benchmark of native ds4-agent against Codex frontier on the same coding-agent tasks. The honest public claim is narrower: DeepSeek V4 Flash is a credible local quasi-frontier coding model, ds4 makes it fast enough to test seriously on Apple Silicon and DGX Spark, and native ds4-agent is interesting because it removes the HTTP API boundary.

That does not prove it beats Codex frontier. Codex frontier still has stronger published agentic coding evals. Treat ds4-agent as a benchmark candidate, not a replacement claim, until it passes the same repo-editing and wiki tasks with comparable tool reliability.

Practical read: use Codex frontier for hard ambiguous repo surgery. Use ds4 and ds4-agent for local/private high-volume loops once the task harness says they are stable.

2. Public evidence

Evidence	What it says	What it does not prove
antirez/ds4 README	`ds4-agent` is an alpha native coding agent. Inference, tool calls, and session state can stay inside the ds4 process, with on-disk KV state instead of an external API loop.	It does not publish a direct Codex frontier comparison.
ds4 issue #211	Headless benchmarking is not clean yet because `ds4-agent` expects a real TTY; a `--pipe` mode is requested.	Current automated comparisons need an expect/script harness or manual runs.
DeepSeek V4 Flash model cards and mirrors	Model-level scores are strong enough to justify local coding-agent trials: SWE Verified up to 79.0, Terminal-Bench 2.0 56.9, Toolathlon 47.8, MCPAtlas 69.0.	Model evals are not the same as a native local coding-agent benchmark.
OpenAI GPT-5.5 / Codex page	OpenAI reports GPT-5.5 in Codex at Terminal-Bench 2.0 82.7%, SWE-Bench Pro public 58.6%, Expert-SWE internal 73.1%, Toolathlon 55.6%, and MCP Atlas 75.3%.	Those are not run on the same local harness as ds4.
Third-party app benchmark	AkitaOnRails reported GPT 5.4 xHigh Codex at 97/100, GPT 5.5 xHigh Codex at 96/100, and DeepSeek V4 Flash at 78/100, with DeepSeek cheaper and faster.	It compares app/model lanes, not native `ds4-agent` on local hardware.

3. Local baseline

Our current local data compares server-backed paths, not native ds4-agent. It still sets the floor that native ds4-agent has to beat.

Path	Model	Pass rate	Median wall	Notes
Spark llama.cpp	`qwen3-coder-30b.gguf`	14/14	1151 ms	Fastest current local profile target across smoke, code, question, and wiki tasks.
Spark ds4 CUDA	DeepSeek V4 Flash q2-imatrix	13/14	~5007 ms	Non-thinking mode; only miss was an abstention wording gate.
Spark Ollama	`qwen3-coder:30b`	13/14	5471 ms	Useful fallback, slower than llama.cpp for the same class of tasks.
M5 ds4-server	`deepseek-v4-flash`	9/14	5853 ms	Good local side engine; weak strict JSON, citation, and abstention gates in this harness.

For raw M5 ds4 generation, the 256-token technical-answer baseline was 61.21 prefill t/s and 38.57 generation t/s, or 76.92 prefill t/s with warm weights. That is useful engine data, but agent promotion still depends on correct edits and valid tool use.

4. Fair test

The missing comparison is not a leaderboard row. It is a harness that runs the same tasks through the same filesystem state, with the same scoring rules, and records tool validity.

Lane	How to run	Record
Codex frontier	Run the hosted Codex profile against the smoke, code, question, and wiki fixtures.	Pass/fail, wall time, tool errors, context used, cost, and whether files/tests ended correct.
ds4-server	Run the existing OpenAI-compatible ds4 path through the same benchmark suite.	Pass/fail, median wall, prompt tokens, generation tokens, malformed JSON/tool calls.
Native `ds4-agent`	Run from disposable repos first. Until pipe mode exists, use a TTY wrapper and keep traces private.	Pass/fail, wall time, repeated-prefill behavior, stale edits, trace redaction status.

# server-backed ds4 baseline
./bench-local-ai/run-ds4.sh 3 --timeout 240

# native ds4-agent fixture, keep traces private
cd /tmp/ds4-agent-fixture
~/src-repo/ds4/ds4-agent \
  -m ~/src-repo/ds4/ds4flash.gguf \
  --ctx 100000 \
  --nothink \
  --trace /tmp/ds4-agent-fixture.trace

5. Recommendation

Do not market native ds4-agent as faster or better than Codex frontier yet. The defensible position is that it is the most interesting local DeepSeek V4 Flash agent path to benchmark next, because it may avoid adapter overhead and preserve KV/session state more directly than a normal API-backed coding client.

The next public benchmark update should add two lanes: hosted Codex frontier on the practical fixture suite, and native ds4-agent on the same suite. Promote the native path only if it matches the server path on correctness and beats it on pass-per-minute without stalls, malformed tools, or private trace leakage. Start with the ds4-agent setup guide before publishing comparison numbers.

1. Bottom line

2. Public evidence

3. Local baseline

4. Fair test

5. Recommendation

6. Sources