ds4
DeepSeek V4 Flash on a 128GB Apple Silicon Mac, with ds4-agent experiments and side profiles for Claude Code, Codex, and Pi.
Last updated
Local Inference Metal1. Where ds4 fits
ds4 is a narrow inference engine for DeepSeek V4 Flash GGUF files, with Metal as the primary Mac path and CUDA support for Linux/DGX Spark. This page focuses on the practical 128GB Apple Silicon Mac target: large enough for the q2-imatrix model and fast enough to be worth comparing against smaller local Qwen models.
Use it as a named side engine first. Do not replace your default Claude, Codex, or Pi setup until it survives real tool-use and repo-editing tests.
| Use it for | Avoid it for now |
|---|---|
| Private long-context local chat | Default day-to-day coding edits |
| DeepSeek V4 Flash quality experiments | Multi-user serving |
| Side-by-side local model tests | Replacing Spark llama.cpp before it wins practical benchmarks |
| Local wiki/research prompts | Anything with tracing enabled by accident |
2. Build and download
Build from source:
mkdir -p ~/src-repo
git clone https://github.com/antirez/ds4.git ~/src-repo/ds4
cd ~/src-repo/ds4
make
Download the q2-imatrix model:
./download_model.sh q2-imatrix
Smoke test the CLI:
./ds4 --ctx 32768 --nothink --temp 0 -n 16 -p "reply with ok"
Known-good local smoke numbers from one earlier q2 setup:
| Prompt | Observed result |
|---|---|
Short ok prompt | Prefill 28.96 t/s, generation 8.93 t/s |
| 256-token explanation | Prefill 61.21 t/s, generation 38.57 t/s |
| Warm weights | Prefill 76.92 t/s, generation 37.81 t/s |
3. Start the server
mkdir -p ~/.cache/ds4-kv
cd ~/src-repo/ds4
./ds4-server \
--ctx 100000 \
--kv-disk-dir ~/.cache/ds4-kv \
--kv-disk-space-mb 8192 \
--host 127.0.0.1 \
--port 8000
| Setting | Start with | Why |
|---|---|---|
| Quant | q2-imatrix | The current preferred 96/128GB target |
| Server context | 100000 | Large enough for direct long-context experiments |
| Codex client context | 80000 | Keeps Codex below the observed Metal KV decode ceiling |
| Pi ds4 path | mitsuhiko/pi-ds4 with --no-tools | Stable chat/transport path; tool mode is still experimental |
| Thinking | Off for speed tests | Compare clean latency first |
| Disk KV | On | Helps repeated agent prompts |
| Trace | Off | Trace can persist prompts and outputs |
Do not let coding-agent clients consume the full advertised 100k window by default unless the client harness has been tested there. On this setup, a Codex request around 97k prompt tokens completed prefill and then failed during decode with a Metal compressed KV cache capacity error. Use an 80k advertised client window and compact around 64k for Codex side profiles. For Pi, prefer the upstream mitsuhiko/pi-ds4 extension first, then benchmark real tasks before overriding its model metadata.
Verify models:
curl -fsS http://127.0.0.1:8000/v1/models
Verify OpenAI-compatible chat:
curl -fsS http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "deepseek-v4-flash",
"messages": [{"role": "user", "content": "reply with ok"}],
"max_tokens": 8,
"temperature": 0,
"think": false,
"stream": false
}'
Verify Anthropic-compatible messages:
curl -fsS http://127.0.0.1:8000/v1/messages \
-H 'Content-Type: application/json' \
-H 'anthropic-version: 2023-06-01' \
-H 'x-api-key: local-test-key' \
-d '{
"model": "deepseek-v4-flash",
"messages": [{"role": "user", "content": "reply with ok"}],
"max_tokens": 8,
"temperature": 0,
"thinking": {"type": "disabled"},
"stream": false
}'
4. Native ds4-agent
Current upstream antirez/ds4 also builds ds4-agent, an alpha native coding agent. It is different from ds4-server: inference, tool calls, and session state stay inside one process, with the on-disk KV cache acting as the session store. That removes the OpenAI/Anthropic/Responses API boundary, but it also means the agent is its own experimental harness.
For a setup-only walkthrough, use ds4-agent setup. This section keeps the shorter Mac ds4 context and benchmark ladder.
ds4-serve, claude-ds4, codex-ds4, and pi-ds4 as the compatibility profiles until native ds4-agent wins real file-editing tests.
Build and inspect the native agent:
cd ~/src-repo/ds4
git pull --ff-only
make ds4-agent ds4-server
./ds4-agent --help
Run it from the ds4 checkout so the relative metal/*.metal shader files resolve, or export absolute Metal source paths in your wrapper. The agent defaults to ./ds4flash.gguf, so pass -m when your model lives elsewhere:
./ds4-agent \
-m ./ds4flash.gguf \
--ctx 100000 \
--nothink \
--warm-weights
Useful first-session commands:
| Command | Use |
|---|---|
/help | Show runtime commands. |
/save | Save the current native-agent session. |
/list | List saved sessions under ~/.ds4/kvcache. |
/switch SHA | Switch to a saved session without a full prefill. |
/new | Start a fresh session from the system prompt. |
Native agent benchmark ladder
Do not benchmark ds4-agent only by tokens per second. The win condition is pass-per-minute on real work: valid tools, correct edits, no stale-context damage, and fewer repeated prefills.
| Step | Prompt | Record |
|---|---|---|
| Build smoke | ./ds4-agent --help | Commit, binary hash, wall time, and whether help exits cleanly. |
| Prompt smoke | ./ds4-agent --nothink -p "reply with exactly OK" | TTFT, wall time, output correctness, model path. |
| Tiny file edit | Create a one-file README in a disposable directory. | Final file state, wrong-file edits, invalid tool calls, retries. |
| Multi-file docs task | Add one section and update a local index in a throwaway docs repo. | Pass/fail, changed files, total wall time, tool rounds. |
| Fixture wiki task | Add one raw note, compile one article, append log in a disposable wiki. | Frontmatter validity, index update, log append, source grounding. |
Keep native-agent numbers separate from server-client numbers. A clean --help or OK result proves installation, not coding-agent readiness.
5. Claude Code profile
Claude Code can call ds4 directly through the Anthropic-compatible endpoint. Keep it as a side command:
claude-ds4
cds4
The important environment is:
export ANTHROPIC_BASE_URL=http://127.0.0.1:8000
export ANTHROPIC_API_KEY=local-test-key
export ANTHROPIC_AUTH_TOKEN=local-test-key
export ANTHROPIC_CUSTOM_MODEL_OPTION=deepseek-v4-flash
Set both ANTHROPIC_API_KEY and ANTHROPIC_AUTH_TOKEN. Bare mode expects API-key style auth; the auth-token variable keeps compatibility with local-gateway patterns.
Smoke test:
claude-ds4 --bare --no-session-persistence --tools '' \
--system-prompt 'Reply only with the final answer.' \
-p 'reply with exactly ok'
6. Codex profile
Current Codex custom providers expect the OpenAI Responses API. Since ds4 exposes Chat Completions and Anthropic Messages, the Codex side profile needs a small local adapter.
Keep the adapter profile-local. Do not hide it inside global Codex config.
Smoke test:
codex-ds4 exec \
--ephemeral \
--skip-git-repo-check \
-s read-only \
--ignore-rules \
'reply with exactly ok'
7. pi-ds4 profile
Pi can call ds4 through the upstream mitsuhiko/pi-ds4 extension. Keep that extension installed in a separate agent directory so the default Pi profile and cloud provider state stay untouched. Use this as a local chat and transport profile first; keep file-editing agents on the Claude/Codex ds4 profiles until direct ds4 tool requests are reliable.
Use a side command:
pi-ds4
The default public Pi path is Armin Ronacher's mitsuhiko/pi-ds4 extension. It registers ds4/deepseek-v4-flash, starts ds4-server on demand, downloads/builds the runtime if needed, keeps a per-Pi-process lease, and exposes /ds4 for logs.
pi install https://github.com/mitsuhiko/pi-ds4
pi --model ds4/deepseek-v4-flash --thinking off -p 'reply with OK'
The agent-stack-bootstrap wrapper keeps that upstream extension in an isolated Pi state directory, launches the public profile with --no-tools, and gives it explicit side commands:
install.shinstalls the profile snippets and sandbox profile.profiles/pi-ds4/aliases.zshexposespi-ds4-install,pi-ds4,pi-ds4-rawdog,pi-ds4-direct, andpi-ds4-bench.bondage.conf.templateshows the normal sandboxed profile and the explicit benchmark bypass profile.
pi-ds4-install
pi-ds4 -p 'reply with OK'
The older custom ds4-tools.ts scaffold-guard setup is not the default public path. It is preserved on the archive/pi-ds4-custom-guard branch for reliability experiments.
Smoke test:
pi-ds4 -p 'reply with OK'
In print mode, current upstream pi-ds4 can show DeepSeek-style reasoning text before the final answer even when Pi is launched with --thinking off. Treat this as a transport smoke test: the important signal is that the command reaches the local ds4 server and the output ends with OK. In local testing, adding OpenAI-style tools to direct ds4 requests stalled the server, so the public wrapper keeps tools disabled by default.
For benchmark runs where the sandbox wrapper itself is being measured, keep an explicit no-nono bypass profile rather than weakening the normal profile:
pi-ds4-rawdog -p 'reply with OK'
pi-ds4-bench
8. Benchmarks
Benchmark the raw endpoint, a minimal CLI call, and a full coding-agent call separately.
For the current public comparison against frontier coding agents, see ds4-agent vs Codex frontier. The short version: native ds4-agent is worth benchmarking, but there is not yet a public apples-to-apples result that proves it beats Codex frontier.
| Path | Warm latency | What it measures |
|---|---|---|
| Raw ds4 OpenAI chat | ~123 ms | Model endpoint only |
| Raw ds4 Anthropic messages | ~121 ms | Model endpoint only |
ds4-agent --help | ~200 ms | Native-agent install smoke only; no model loaded |
claude-ds4 bare/no-tools | ~2.0 s | Claude CLI plus local endpoint |
pi-ds4-bench no-tools smoke test | ~3-4 s | Pi CLI plus upstream extension, no nono, no tools |
| Direct ds4 tool request | stalled over 30 s | OpenAI-style tools request to ds4-server; keep disabled by default |
codex-ds4 coding profile | ~25-26 s | Full Codex agent prompt plus tool schemas |
Legacy manual Pi context sweep
These numbers are from the earlier manual models.json + custom extension profile, not the upstream mitsuhiko/pi-ds4 path. They are useful as a baseline for Pi harness overhead, but the default public setup now starts with the upstream extension.
| Setting | Pass | Avg | P50 | OK avg | Read avg | Edit avg |
|---|---|---|---|---|---|---|
| 16k ctx / 2048 max | 6/6 | 13.8 s | 9.7 s | 8.6 s | 9.7 s | 23.2 s |
| 32k ctx / 2048 max | 6/6 | 16.4 s | 12.0 s | 9.3 s | 12.0 s | 27.7 s |
| 32k ctx / 4096 max | 6/6 | 19.9 s | 15.5 s | 11.8 s | 15.5 s | 32.4 s |
| 64k ctx / 4096 max | 6/6 | 21.0 s | 16.7 s | 13.4 s | 16.7 s | 32.8 s |
Legacy benchmark summary: 16k/2048 was the fastest fully passing manual profile. 32k/2048 also passed, but averaged about 18% slower. 32k/4096 and 64k/4096 were slower with no reliability gain. The raw endpoint is fast; coding-agent calls are slower because they carry the agent contract and tool schemas.
How to compare ds4-agent fairly
Use the same tasks for every path and record pass rate before speed. A useful comparison row includes runtime, model, context, quant, prompt kind, pass/fail, median wall time, total wall time, and whether the final file or wiki state is correct.
# server path
./bench-local-ai/run-ds4.sh 3 --timeout 240
# native path, run from a disposable repo first
cd /tmp/ds4-agent-fixture
~/src-repo/ds4/ds4-agent \
-m ~/src-repo/ds4/ds4flash.gguf \
--ctx 100000 \
--nothink \
--trace /tmp/ds4-agent-fixture.trace
Trace files are useful for debugging malformed tools and stale edits, but they can contain prompts, file contents, tool outputs, and generated text. Keep them out of public repos.
9. Privacy and gotchas
- Keep
ds4-serverbound to127.0.0.1unless you have deliberate authentication and firewall rules. - Leave trace logging off by default; traces can persist prompts, outputs, and tool calls.
- Use named side profiles. Do not override
claude,codex, orpi. - Do not publish local usernames, hostnames, LAN IPs, known-host fingerprints, or absolute machine paths in guide examples.
- Do not run installers in parallel if they append to the same shell startup file.