ds4

DeepSeek V4 Flash on a 128GB Apple Silicon Mac, with ds4-agent experiments and side profiles for Claude Code, Codex, and Pi.

Last updated

Local Inference Metal

1. Where ds4 fits

ds4 is a narrow inference engine for DeepSeek V4 Flash GGUF files, with Metal as the primary Mac path and CUDA support for Linux/DGX Spark. This page focuses on the practical 128GB Apple Silicon Mac target: large enough for the q2-imatrix model and fast enough to be worth comparing against smaller local Qwen models.

Use it as a named side engine first. Do not replace your default Claude, Codex, or Pi setup until it survives real tool-use and repo-editing tests.

Use it forAvoid it for now
Private long-context local chatDefault day-to-day coding edits
DeepSeek V4 Flash quality experimentsMulti-user serving
Side-by-side local model testsReplacing Spark llama.cpp before it wins practical benchmarks
Local wiki/research promptsAnything with tracing enabled by accident

2. Build and download

Build from source:

mkdir -p ~/src-repo
git clone https://github.com/antirez/ds4.git ~/src-repo/ds4
cd ~/src-repo/ds4
make

Download the q2-imatrix model:

./download_model.sh q2-imatrix
Do not start with q4 on a 128GB Mac. The q4 file is too large once runtime overhead is included. Start with q2-imatrix; use legacy q2 only if you specifically need to compare against older runs.

Smoke test the CLI:

./ds4 --ctx 32768 --nothink --temp 0 -n 16 -p "reply with ok"

Known-good local smoke numbers from one earlier q2 setup:

PromptObserved result
Short ok promptPrefill 28.96 t/s, generation 8.93 t/s
256-token explanationPrefill 61.21 t/s, generation 38.57 t/s
Warm weightsPrefill 76.92 t/s, generation 37.81 t/s

3. Start the server

mkdir -p ~/.cache/ds4-kv
cd ~/src-repo/ds4
./ds4-server \
  --ctx 100000 \
  --kv-disk-dir ~/.cache/ds4-kv \
  --kv-disk-space-mb 8192 \
  --host 127.0.0.1 \
  --port 8000
SettingStart withWhy
Quantq2-imatrixThe current preferred 96/128GB target
Server context100000Large enough for direct long-context experiments
Codex client context80000Keeps Codex below the observed Metal KV decode ceiling
Pi ds4 pathmitsuhiko/pi-ds4 with --no-toolsStable chat/transport path; tool mode is still experimental
ThinkingOff for speed testsCompare clean latency first
Disk KVOnHelps repeated agent prompts
TraceOffTrace can persist prompts and outputs

Do not let coding-agent clients consume the full advertised 100k window by default unless the client harness has been tested there. On this setup, a Codex request around 97k prompt tokens completed prefill and then failed during decode with a Metal compressed KV cache capacity error. Use an 80k advertised client window and compact around 64k for Codex side profiles. For Pi, prefer the upstream mitsuhiko/pi-ds4 extension first, then benchmark real tasks before overriding its model metadata.

Verify models:

curl -fsS http://127.0.0.1:8000/v1/models

Verify OpenAI-compatible chat:

curl -fsS http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "deepseek-v4-flash",
    "messages": [{"role": "user", "content": "reply with ok"}],
    "max_tokens": 8,
    "temperature": 0,
    "think": false,
    "stream": false
  }'

Verify Anthropic-compatible messages:

curl -fsS http://127.0.0.1:8000/v1/messages \
  -H 'Content-Type: application/json' \
  -H 'anthropic-version: 2023-06-01' \
  -H 'x-api-key: local-test-key' \
  -d '{
    "model": "deepseek-v4-flash",
    "messages": [{"role": "user", "content": "reply with ok"}],
    "max_tokens": 8,
    "temperature": 0,
    "thinking": {"type": "disabled"},
    "stream": false
  }'

4. Native ds4-agent

Current upstream antirez/ds4 also builds ds4-agent, an alpha native coding agent. It is different from ds4-server: inference, tool calls, and session state stay inside one process, with the on-disk KV cache acting as the session store. That removes the OpenAI/Anthropic/Responses API boundary, but it also means the agent is its own experimental harness.

For a setup-only walkthrough, use ds4-agent setup. This section keeps the shorter Mac ds4 context and benchmark ladder.

Use this as a benchmark target first. Keep ds4-serve, claude-ds4, codex-ds4, and pi-ds4 as the compatibility profiles until native ds4-agent wins real file-editing tests.

Build and inspect the native agent:

cd ~/src-repo/ds4
git pull --ff-only
make ds4-agent ds4-server
./ds4-agent --help

Run it from the ds4 checkout so the relative metal/*.metal shader files resolve, or export absolute Metal source paths in your wrapper. The agent defaults to ./ds4flash.gguf, so pass -m when your model lives elsewhere:

./ds4-agent \
  -m ./ds4flash.gguf \
  --ctx 100000 \
  --nothink \
  --warm-weights

Useful first-session commands:

CommandUse
/helpShow runtime commands.
/saveSave the current native-agent session.
/listList saved sessions under ~/.ds4/kvcache.
/switch SHASwitch to a saved session without a full prefill.
/newStart a fresh session from the system prompt.

Native agent benchmark ladder

Do not benchmark ds4-agent only by tokens per second. The win condition is pass-per-minute on real work: valid tools, correct edits, no stale-context damage, and fewer repeated prefills.

StepPromptRecord
Build smoke./ds4-agent --helpCommit, binary hash, wall time, and whether help exits cleanly.
Prompt smoke./ds4-agent --nothink -p "reply with exactly OK"TTFT, wall time, output correctness, model path.
Tiny file editCreate a one-file README in a disposable directory.Final file state, wrong-file edits, invalid tool calls, retries.
Multi-file docs taskAdd one section and update a local index in a throwaway docs repo.Pass/fail, changed files, total wall time, tool rounds.
Fixture wiki taskAdd one raw note, compile one article, append log in a disposable wiki.Frontmatter validity, index update, log append, source grounding.

Keep native-agent numbers separate from server-client numbers. A clean --help or OK result proves installation, not coding-agent readiness.

5. Claude Code profile

Claude Code can call ds4 directly through the Anthropic-compatible endpoint. Keep it as a side command:

claude-ds4
cds4

The important environment is:

export ANTHROPIC_BASE_URL=http://127.0.0.1:8000
export ANTHROPIC_API_KEY=local-test-key
export ANTHROPIC_AUTH_TOKEN=local-test-key
export ANTHROPIC_CUSTOM_MODEL_OPTION=deepseek-v4-flash

Set both ANTHROPIC_API_KEY and ANTHROPIC_AUTH_TOKEN. Bare mode expects API-key style auth; the auth-token variable keeps compatibility with local-gateway patterns.

Smoke test:

claude-ds4 --bare --no-session-persistence --tools '' \
  --system-prompt 'Reply only with the final answer.' \
  -p 'reply with exactly ok'

6. Codex profile

Current Codex custom providers expect the OpenAI Responses API. Since ds4 exposes Chat Completions and Anthropic Messages, the Codex side profile needs a small local adapter.

Codex -> local Responses adapter -> ds4 Chat Completions -> DeepSeek V4 Flash

Keep the adapter profile-local. Do not hide it inside global Codex config.

Smoke test:

codex-ds4 exec \
  --ephemeral \
  --skip-git-repo-check \
  -s read-only \
  --ignore-rules \
  'reply with exactly ok'
Tool schemas change the speed profile. Plain chat can be fast. A coding-capable Codex profile sends the agent contract and tool schemas, which can add thousands of prompt tokens before your actual prompt starts.

7. pi-ds4 profile

Pi can call ds4 through the upstream mitsuhiko/pi-ds4 extension. Keep that extension installed in a separate agent directory so the default Pi profile and cloud provider state stay untouched. Use this as a local chat and transport profile first; keep file-editing agents on the Claude/Codex ds4 profiles until direct ds4 tool requests are reliable.

Pi -> ds4 OpenAI Chat Completions -> DeepSeek V4 Flash

Use a side command:

pi-ds4

The default public Pi path is Armin Ronacher's mitsuhiko/pi-ds4 extension. It registers ds4/deepseek-v4-flash, starts ds4-server on demand, downloads/builds the runtime if needed, keeps a per-Pi-process lease, and exposes /ds4 for logs.

pi install https://github.com/mitsuhiko/pi-ds4
pi --model ds4/deepseek-v4-flash --thinking off -p 'reply with OK'

The agent-stack-bootstrap wrapper keeps that upstream extension in an isolated Pi state directory, launches the public profile with --no-tools, and gives it explicit side commands:

pi-ds4-install
pi-ds4 -p 'reply with OK'

The older custom ds4-tools.ts scaffold-guard setup is not the default public path. It is preserved on the archive/pi-ds4-custom-guard branch for reliability experiments.

Smoke test:

pi-ds4 -p 'reply with OK'

In print mode, current upstream pi-ds4 can show DeepSeek-style reasoning text before the final answer even when Pi is launched with --thinking off. Treat this as a transport smoke test: the important signal is that the command reaches the local ds4 server and the output ends with OK. In local testing, adding OpenAI-style tools to direct ds4 requests stalled the server, so the public wrapper keeps tools disabled by default.

For benchmark runs where the sandbox wrapper itself is being measured, keep an explicit no-nono bypass profile rather than weakening the normal profile:

pi-ds4-rawdog -p 'reply with OK'
pi-ds4-bench

8. Benchmarks

Benchmark the raw endpoint, a minimal CLI call, and a full coding-agent call separately.

For the current public comparison against frontier coding agents, see ds4-agent vs Codex frontier. The short version: native ds4-agent is worth benchmarking, but there is not yet a public apples-to-apples result that proves it beats Codex frontier.

CPU benchmark caveat: a Python Sieve of Eratosthenes sanity check favors the M5 Max over DGX Spark, but it measures scalar CPU/Python speed, not model inference. In a repeated naive list-sieve run to 1,000,000, the M5 measured 0.018429 s median versus Spark at 0.030137 s median. In an optimized bytearray variant, the M5 measured 0.002352 s versus Spark at 0.003930 s.
PathWarm latencyWhat it measures
Raw ds4 OpenAI chat~123 msModel endpoint only
Raw ds4 Anthropic messages~121 msModel endpoint only
ds4-agent --help~200 msNative-agent install smoke only; no model loaded
claude-ds4 bare/no-tools~2.0 sClaude CLI plus local endpoint
pi-ds4-bench no-tools smoke test~3-4 sPi CLI plus upstream extension, no nono, no tools
Direct ds4 tool requeststalled over 30 sOpenAI-style tools request to ds4-server; keep disabled by default
codex-ds4 coding profile~25-26 sFull Codex agent prompt plus tool schemas

Legacy manual Pi context sweep

These numbers are from the earlier manual models.json + custom extension profile, not the upstream mitsuhiko/pi-ds4 path. They are useful as a baseline for Pi harness overhead, but the default public setup now starts with the upstream extension.

SettingPassAvgP50OK avgRead avgEdit avg
16k ctx / 2048 max6/613.8 s9.7 s8.6 s9.7 s23.2 s
32k ctx / 2048 max6/616.4 s12.0 s9.3 s12.0 s27.7 s
32k ctx / 4096 max6/619.9 s15.5 s11.8 s15.5 s32.4 s
64k ctx / 4096 max6/621.0 s16.7 s13.4 s16.7 s32.8 s

Legacy benchmark summary: 16k/2048 was the fastest fully passing manual profile. 32k/2048 also passed, but averaged about 18% slower. 32k/4096 and 64k/4096 were slower with no reliability gain. The raw endpoint is fast; coding-agent calls are slower because they carry the agent contract and tool schemas.

How to compare ds4-agent fairly

Use the same tasks for every path and record pass rate before speed. A useful comparison row includes runtime, model, context, quant, prompt kind, pass/fail, median wall time, total wall time, and whether the final file or wiki state is correct.

# server path
./bench-local-ai/run-ds4.sh 3 --timeout 240

# native path, run from a disposable repo first
cd /tmp/ds4-agent-fixture
~/src-repo/ds4/ds4-agent \
  -m ~/src-repo/ds4/ds4flash.gguf \
  --ctx 100000 \
  --nothink \
  --trace /tmp/ds4-agent-fixture.trace

Trace files are useful for debugging malformed tools and stale edits, but they can contain prompts, file contents, tool outputs, and generated text. Keep them out of public repos.

9. Privacy and gotchas

  • Keep ds4-server bound to 127.0.0.1 unless you have deliberate authentication and firewall rules.
  • Leave trace logging off by default; traces can persist prompts, outputs, and tool calls.
  • Use named side profiles. Do not override claude, codex, or pi.
  • Do not publish local usernames, hostnames, LAN IPs, known-host fingerprints, or absolute machine paths in guide examples.
  • Do not run installers in parallel if they append to the same shell startup file.
Shell name -> side profile -> ds4 localhost server -> local model