ds4-agent vs codex frontier

What public benchmarks do and do not prove about native ds4-agent, local DeepSeek V4 Flash, and Codex frontier coding-agent performance.

Last updated

Evaluation Local AI

1. Bottom line

I could not find a public apples-to-apples benchmark of native ds4-agent against Codex frontier on the same coding-agent tasks. The honest public claim is narrower: DeepSeek V4 Flash is a credible local quasi-frontier coding model, ds4 makes it fast enough to test seriously on Apple Silicon and DGX Spark, and native ds4-agent is interesting because it removes the HTTP API boundary.

That does not prove it beats Codex frontier. Codex frontier still has stronger published agentic coding evals. Treat ds4-agent as a benchmark candidate, not a replacement claim, until it passes the same repo-editing and wiki tasks with comparable tool reliability.

Practical read: use Codex frontier for hard ambiguous repo surgery. Use ds4 and ds4-agent for local/private high-volume loops once the task harness says they are stable.

2. Public evidence

EvidenceWhat it saysWhat it does not prove
antirez/ds4 READMEds4-agent is an alpha native coding agent. Inference, tool calls, and session state can stay inside the ds4 process, with on-disk KV state instead of an external API loop.It does not publish a direct Codex frontier comparison.
ds4 issue #211Headless benchmarking is not clean yet because ds4-agent expects a real TTY; a --pipe mode is requested.Current automated comparisons need an expect/script harness or manual runs.
DeepSeek V4 Flash model cards and mirrorsModel-level scores are strong enough to justify local coding-agent trials: SWE Verified up to 79.0, Terminal-Bench 2.0 56.9, Toolathlon 47.8, MCPAtlas 69.0.Model evals are not the same as a native local coding-agent benchmark.
OpenAI GPT-5.5 / Codex pageOpenAI reports GPT-5.5 in Codex at Terminal-Bench 2.0 82.7%, SWE-Bench Pro public 58.6%, Expert-SWE internal 73.1%, Toolathlon 55.6%, and MCP Atlas 75.3%.Those are not run on the same local harness as ds4.
Third-party app benchmarkAkitaOnRails reported GPT 5.4 xHigh Codex at 97/100, GPT 5.5 xHigh Codex at 96/100, and DeepSeek V4 Flash at 78/100, with DeepSeek cheaper and faster.It compares app/model lanes, not native ds4-agent on local hardware.

3. Local baseline

Our current local data compares server-backed paths, not native ds4-agent. It still sets the floor that native ds4-agent has to beat.

PathModelPass rateMedian wallNotes
Spark llama.cppqwen3-coder-30b.gguf14/141151 msFastest current local profile target across smoke, code, question, and wiki tasks.
Spark ds4 CUDADeepSeek V4 Flash q2-imatrix13/14~5007 msNon-thinking mode; only miss was an abstention wording gate.
Spark Ollamaqwen3-coder:30b13/145471 msUseful fallback, slower than llama.cpp for the same class of tasks.
M5 ds4-serverdeepseek-v4-flash9/145853 msGood local side engine; weak strict JSON, citation, and abstention gates in this harness.

For raw M5 ds4 generation, the 256-token technical-answer baseline was 61.21 prefill t/s and 38.57 generation t/s, or 76.92 prefill t/s with warm weights. That is useful engine data, but agent promotion still depends on correct edits and valid tool use.

4. Fair test

The missing comparison is not a leaderboard row. It is a harness that runs the same tasks through the same filesystem state, with the same scoring rules, and records tool validity.

LaneHow to runRecord
Codex frontierRun the hosted Codex profile against the smoke, code, question, and wiki fixtures.Pass/fail, wall time, tool errors, context used, cost, and whether files/tests ended correct.
ds4-serverRun the existing OpenAI-compatible ds4 path through the same benchmark suite.Pass/fail, median wall, prompt tokens, generation tokens, malformed JSON/tool calls.
Native ds4-agentRun from disposable repos first. Until pipe mode exists, use a TTY wrapper and keep traces private.Pass/fail, wall time, repeated-prefill behavior, stale edits, trace redaction status.
# server-backed ds4 baseline
./bench-local-ai/run-ds4.sh 3 --timeout 240

# native ds4-agent fixture, keep traces private
cd /tmp/ds4-agent-fixture
~/src-repo/ds4/ds4-agent \
  -m ~/src-repo/ds4/ds4flash.gguf \
  --ctx 100000 \
  --nothink \
  --trace /tmp/ds4-agent-fixture.trace

5. Recommendation

Do not market native ds4-agent as faster or better than Codex frontier yet. The defensible position is that it is the most interesting local DeepSeek V4 Flash agent path to benchmark next, because it may avoid adapter overhead and preserve KV/session state more directly than a normal API-backed coding client.

The next public benchmark update should add two lanes: hosted Codex frontier on the practical fixture suite, and native ds4-agent on the same suite. Promote the native path only if it matches the server path on correctness and beats it on pass-per-minute without stalls, malformed tools, or private trace leakage. Start with the ds4-agent setup guide before publishing comparison numbers.

6. Sources