ds4
DeepSeek V4 Flash on a 128GB Apple Silicon Mac, with side profiles for Claude Code, Codex, and Pi.
Last updated
Local Inference Metal1. Where ds4 fits
ds4 is a Metal-only inference engine for DeepSeek V4 Flash GGUF files. The practical target is a 128GB Apple Silicon Mac: large enough for the q2 model and fast enough to be worth comparing against smaller local Qwen models.
Use it as a named side engine first. Do not replace your default Claude, Codex, or Pi setup until it survives real tool-use and repo-editing tests.
| Use it for | Avoid it for now |
|---|---|
| Private long-context local chat | Default day-to-day coding edits |
| DeepSeek V4 Flash quality experiments | Multi-user serving |
| Side-by-side local model tests | CUDA or DGX Spark workflows |
| Local wiki/research prompts | Anything with tracing enabled by accident |
2. Build and download
Build from source:
mkdir -p ~/src-repo
git clone https://github.com/antirez/ds4.git ~/src-repo/ds4
cd ~/src-repo/ds4
make
Download the q2 model:
./download_model.sh q2
Smoke test the CLI:
./ds4 --ctx 32768 --nothink --temp 0 -n 16 -p "reply with ok"
Known-good local smoke numbers from one setup:
| Prompt | Observed result |
|---|---|
Short ok prompt | Prefill 28.96 t/s, generation 8.93 t/s |
| 256-token explanation | Prefill 61.21 t/s, generation 38.57 t/s |
| Warm weights | Prefill 76.92 t/s, generation 37.81 t/s |
3. Start the server
mkdir -p ~/.cache/ds4-kv
cd ~/src-repo/ds4
./ds4-server \
--ctx 100000 \
--kv-disk-dir ~/.cache/ds4-kv \
--kv-disk-space-mb 8192 \
--host 127.0.0.1 \
--port 8000
| Setting | Start with | Why |
|---|---|---|
| Quant | q2 | The practical 128GB target |
| Server context | 100000 | Large enough for direct long-context experiments |
| Codex client context | 80000 | Keeps Codex below the observed Metal KV decode ceiling |
| Pi ds4 path | mitsuhiko/pi-ds4 with --no-tools | Stable chat/transport path; tool mode is still experimental |
| Thinking | Off for speed tests | Compare clean latency first |
| Disk KV | On | Helps repeated agent prompts |
| Trace | Off | Trace can persist prompts and outputs |
Do not let coding-agent clients consume the full advertised 100k window by default unless the client harness has been tested there. On this setup, a Codex request around 97k prompt tokens completed prefill and then failed during decode with a Metal compressed KV cache capacity error. Use an 80k advertised client window and compact around 64k for Codex side profiles. For Pi, prefer the upstream mitsuhiko/pi-ds4 extension first, then benchmark real tasks before overriding its model metadata.
Verify models:
curl -fsS http://127.0.0.1:8000/v1/models
Verify OpenAI-compatible chat:
curl -fsS http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "deepseek-v4-flash",
"messages": [{"role": "user", "content": "reply with ok"}],
"max_tokens": 8,
"temperature": 0,
"think": false,
"stream": false
}'
Verify Anthropic-compatible messages:
curl -fsS http://127.0.0.1:8000/v1/messages \
-H 'Content-Type: application/json' \
-H 'anthropic-version: 2023-06-01' \
-H 'x-api-key: local-test-key' \
-d '{
"model": "deepseek-v4-flash",
"messages": [{"role": "user", "content": "reply with ok"}],
"max_tokens": 8,
"temperature": 0,
"thinking": {"type": "disabled"},
"stream": false
}'
4. Claude Code profile
Claude Code can call ds4 directly through the Anthropic-compatible endpoint. Keep it as a side command:
claude-ds4
cds4
The important environment is:
export ANTHROPIC_BASE_URL=http://127.0.0.1:8000
export ANTHROPIC_API_KEY=local-test-key
export ANTHROPIC_AUTH_TOKEN=local-test-key
export ANTHROPIC_CUSTOM_MODEL_OPTION=deepseek-v4-flash
Set both ANTHROPIC_API_KEY and ANTHROPIC_AUTH_TOKEN. Bare mode expects API-key style auth; the auth-token variable keeps compatibility with local-gateway patterns.
Smoke test:
claude-ds4 --bare --no-session-persistence --tools '' \
--system-prompt 'Reply only with the final answer.' \
-p 'reply with exactly ok'
5. Codex profile
Current Codex custom providers expect the OpenAI Responses API. Since ds4 exposes Chat Completions and Anthropic Messages, the Codex side profile needs a small local adapter.
Keep the adapter profile-local. Do not hide it inside global Codex config.
Smoke test:
codex-ds4 exec \
--ephemeral \
--skip-git-repo-check \
-s read-only \
--ignore-rules \
'reply with exactly ok'
6. pi-ds4 profile
Pi can call ds4 through the upstream mitsuhiko/pi-ds4 extension. Keep that extension installed in a separate agent directory so the default Pi profile and cloud provider state stay untouched. Use this as a local chat and transport profile first; keep file-editing agents on the Claude/Codex ds4 profiles until direct ds4 tool requests are reliable.
Use a side command:
pi-ds4
The default public Pi path is Armin Ronacher's mitsuhiko/pi-ds4 extension. It registers ds4/deepseek-v4-flash, starts ds4-server on demand, downloads/builds the runtime if needed, keeps a per-Pi-process lease, and exposes /ds4 for logs.
pi install https://github.com/mitsuhiko/pi-ds4
pi --model ds4/deepseek-v4-flash --thinking off -p 'reply with OK'
The agent-stack-bootstrap wrapper keeps that upstream extension in an isolated Pi state directory, launches the public profile with --no-tools, and gives it explicit side commands:
install.shinstalls the profile snippets and sandbox profile.profiles/pi-ds4/aliases.zshexposespi-ds4-install,pi-ds4,pi-ds4-rawdog,pi-ds4-direct, andpi-ds4-bench.bondage.conf.templateshows the normal sandboxed profile and the explicit benchmark bypass profile.
pi-ds4-install
pi-ds4 -p 'reply with OK'
The older custom ds4-tools.ts scaffold-guard setup is not the default public path. It is preserved on the archive/pi-ds4-custom-guard branch for reliability experiments.
Smoke test:
pi-ds4 -p 'reply with OK'
In print mode, current upstream pi-ds4 can show DeepSeek-style reasoning text before the final answer even when Pi is launched with --thinking off. Treat this as a transport smoke test: the important signal is that the command reaches the local ds4 server and the output ends with OK. In local testing, adding OpenAI-style tools to direct ds4 requests stalled the server, so the public wrapper keeps tools disabled by default.
For benchmark runs where the sandbox wrapper itself is being measured, keep an explicit no-nono bypass profile rather than weakening the normal profile:
pi-ds4-rawdog -p 'reply with OK'
pi-ds4-bench
7. Benchmarks
Benchmark the raw endpoint, a minimal CLI call, and a full coding-agent call separately.
| Path | Warm latency | What it measures |
|---|---|---|
| Raw ds4 OpenAI chat | ~123 ms | Model endpoint only |
| Raw ds4 Anthropic messages | ~121 ms | Model endpoint only |
claude-ds4 bare/no-tools | ~2.0 s | Claude CLI plus local endpoint |
pi-ds4-bench no-tools smoke test | ~3-4 s | Pi CLI plus upstream extension, no nono, no tools |
| Direct ds4 tool request | stalled over 30 s | OpenAI-style tools request to ds4-server; keep disabled by default |
codex-ds4 coding profile | ~25-26 s | Full Codex agent prompt plus tool schemas |
Legacy manual Pi context sweep
These numbers are from the earlier manual models.json + custom extension profile, not the upstream mitsuhiko/pi-ds4 path. They are useful as a baseline for Pi harness overhead, but the default public setup now starts with the upstream extension.
| Setting | Pass | Avg | P50 | OK avg | Read avg | Edit avg |
|---|---|---|---|---|---|---|
| 16k ctx / 2048 max | 6/6 | 13.8 s | 9.7 s | 8.6 s | 9.7 s | 23.2 s |
| 32k ctx / 2048 max | 6/6 | 16.4 s | 12.0 s | 9.3 s | 12.0 s | 27.7 s |
| 32k ctx / 4096 max | 6/6 | 19.9 s | 15.5 s | 11.8 s | 15.5 s | 32.4 s |
| 64k ctx / 4096 max | 6/6 | 21.0 s | 16.7 s | 13.4 s | 16.7 s | 32.8 s |
Legacy benchmark summary: 16k/2048 was the fastest fully passing manual profile. 32k/2048 also passed, but averaged about 18% slower. 32k/4096 and 64k/4096 were slower with no reliability gain. The raw endpoint is fast; coding-agent calls are slower because they carry the agent contract and tool schemas.
8. Privacy and gotchas
- Keep
ds4-serverbound to127.0.0.1unless you have deliberate authentication and firewall rules. - Leave trace logging off by default; traces can persist prompts, outputs, and tool calls.
- Use named side profiles. Do not override
claude,codex, orpi. - Do not publish local usernames, hostnames, LAN IPs, known-host fingerprints, or absolute machine paths in guide examples.
- Do not run installers in parallel if they append to the same shell startup file.