ds4 Setup Guide - DeepSeek V4 Flash on Apple Silicon

1. Where ds4 fits

ds4 is a Metal-only inference engine for DeepSeek V4 Flash GGUF files. The practical target is a 128GB Apple Silicon Mac: large enough for the q2 model and fast enough to be worth comparing against smaller local Qwen models.

Use it as a named side engine first. Do not replace your default Claude, Codex, or Pi setup until it survives real tool-use and repo-editing tests.

Use it for	Avoid it for now
Private long-context local chat	Default day-to-day coding edits
DeepSeek V4 Flash quality experiments	Multi-user serving
Side-by-side local model tests	CUDA or DGX Spark workflows
Local wiki/research prompts	Anything with tracing enabled by accident

2. Build and download

Build from source:

mkdir -p ~/src-repo
git clone https://github.com/antirez/ds4.git ~/src-repo/ds4
cd ~/src-repo/ds4
make

Download the q2 model:

./download_model.sh q2

Do not start with q4 on a 128GB Mac. The q4 file is too large once runtime overhead is included. Start with q2.

Smoke test the CLI:

./ds4 --ctx 32768 --nothink --temp 0 -n 16 -p "reply with ok"

Known-good local smoke numbers from one setup:

Prompt	Observed result
Short `ok` prompt	Prefill 28.96 t/s, generation 8.93 t/s
256-token explanation	Prefill 61.21 t/s, generation 38.57 t/s
Warm weights	Prefill 76.92 t/s, generation 37.81 t/s

3. Start the server

mkdir -p ~/.cache/ds4-kv
cd ~/src-repo/ds4
./ds4-server \
  --ctx 100000 \
  --kv-disk-dir ~/.cache/ds4-kv \
  --kv-disk-space-mb 8192 \
  --host 127.0.0.1 \
  --port 8000

Setting	Start with	Why
Quant	q2	The practical 128GB target
Server context	100000	Large enough for direct long-context experiments
Codex client context	80000	Keeps Codex below the observed Metal KV decode ceiling
Pi ds4 path	`mitsuhiko/pi-ds4` with `--no-tools`	Stable chat/transport path; tool mode is still experimental
Thinking	Off for speed tests	Compare clean latency first
Disk KV	On	Helps repeated agent prompts
Trace	Off	Trace can persist prompts and outputs

Do not let coding-agent clients consume the full advertised 100k window by default unless the client harness has been tested there. On this setup, a Codex request around 97k prompt tokens completed prefill and then failed during decode with a Metal compressed KV cache capacity error. Use an 80k advertised client window and compact around 64k for Codex side profiles. For Pi, prefer the upstream mitsuhiko/pi-ds4 extension first, then benchmark real tasks before overriding its model metadata.

Verify models:

curl -fsS http://127.0.0.1:8000/v1/models

Verify OpenAI-compatible chat:

curl -fsS http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "deepseek-v4-flash",
    "messages": [{"role": "user", "content": "reply with ok"}],
    "max_tokens": 8,
    "temperature": 0,
    "think": false,
    "stream": false
  }'

Verify Anthropic-compatible messages:

curl -fsS http://127.0.0.1:8000/v1/messages \
  -H 'Content-Type: application/json' \
  -H 'anthropic-version: 2023-06-01' \
  -H 'x-api-key: local-test-key' \
  -d '{
    "model": "deepseek-v4-flash",
    "messages": [{"role": "user", "content": "reply with ok"}],
    "max_tokens": 8,
    "temperature": 0,
    "thinking": {"type": "disabled"},
    "stream": false
  }'

4. Claude Code profile

Claude Code can call ds4 directly through the Anthropic-compatible endpoint. Keep it as a side command:

claude-ds4
cds4

The important environment is:

export ANTHROPIC_BASE_URL=http://127.0.0.1:8000
export ANTHROPIC_API_KEY=local-test-key
export ANTHROPIC_AUTH_TOKEN=local-test-key
export ANTHROPIC_CUSTOM_MODEL_OPTION=deepseek-v4-flash

Set both ANTHROPIC_API_KEY and ANTHROPIC_AUTH_TOKEN. Bare mode expects API-key style auth; the auth-token variable keeps compatibility with local-gateway patterns.

Smoke test:

claude-ds4 --bare --no-session-persistence --tools '' \
  --system-prompt 'Reply only with the final answer.' \
  -p 'reply with exactly ok'

5. Codex profile

Current Codex custom providers expect the OpenAI Responses API. Since ds4 exposes Chat Completions and Anthropic Messages, the Codex side profile needs a small local adapter.

Codex -> local Responses adapter -> ds4 Chat Completions -> DeepSeek V4 Flash

Keep the adapter profile-local. Do not hide it inside global Codex config.

Smoke test:

codex-ds4 exec \
  --ephemeral \
  --skip-git-repo-check \
  -s read-only \
  --ignore-rules \
  'reply with exactly ok'

Tool schemas change the speed profile. Plain chat can be fast. A coding-capable Codex profile sends the agent contract and tool schemas, which can add thousands of prompt tokens before your actual prompt starts.

6. pi-ds4 profile

Pi can call ds4 through the upstream mitsuhiko/pi-ds4 extension. Keep that extension installed in a separate agent directory so the default Pi profile and cloud provider state stay untouched. Use this as a local chat and transport profile first; keep file-editing agents on the Claude/Codex ds4 profiles until direct ds4 tool requests are reliable.

Pi -> ds4 OpenAI Chat Completions -> DeepSeek V4 Flash

Use a side command:

pi-ds4

The default public Pi path is Armin Ronacher's mitsuhiko/pi-ds4 extension. It registers ds4/deepseek-v4-flash, starts ds4-server on demand, downloads/builds the runtime if needed, keeps a per-Pi-process lease, and exposes /ds4 for logs.

pi install https://github.com/mitsuhiko/pi-ds4
pi --model ds4/deepseek-v4-flash --thinking off -p 'reply with OK'

The agent-stack-bootstrap wrapper keeps that upstream extension in an isolated Pi state directory, launches the public profile with --no-tools, and gives it explicit side commands:

install.sh installs the profile snippets and sandbox profile.
profiles/pi-ds4/aliases.zsh exposes pi-ds4-install, pi-ds4, pi-ds4-rawdog, pi-ds4-direct, and pi-ds4-bench.
bondage.conf.template shows the normal sandboxed profile and the explicit benchmark bypass profile.

pi-ds4-install
pi-ds4 -p 'reply with OK'

The older custom ds4-tools.ts scaffold-guard setup is not the default public path. It is preserved on the archive/pi-ds4-custom-guard branch for reliability experiments.

Smoke test:

pi-ds4 -p 'reply with OK'

In print mode, current upstream pi-ds4 can show DeepSeek-style reasoning text before the final answer even when Pi is launched with --thinking off. Treat this as a transport smoke test: the important signal is that the command reaches the local ds4 server and the output ends with OK. In local testing, adding OpenAI-style tools to direct ds4 requests stalled the server, so the public wrapper keeps tools disabled by default.

For benchmark runs where the sandbox wrapper itself is being measured, keep an explicit no-nono bypass profile rather than weakening the normal profile:

pi-ds4-rawdog -p 'reply with OK'
pi-ds4-bench

7. Benchmarks

Benchmark the raw endpoint, a minimal CLI call, and a full coding-agent call separately.

Path	Warm latency	What it measures
Raw ds4 OpenAI chat	~123 ms	Model endpoint only
Raw ds4 Anthropic messages	~121 ms	Model endpoint only
`claude-ds4` bare/no-tools	~2.0 s	Claude CLI plus local endpoint
`pi-ds4-bench` no-tools smoke test	~3-4 s	Pi CLI plus upstream extension, no `nono`, no tools
Direct ds4 tool request	stalled over 30 s	OpenAI-style `tools` request to ds4-server; keep disabled by default
`codex-ds4` coding profile	~25-26 s	Full Codex agent prompt plus tool schemas

Legacy manual Pi context sweep

These numbers are from the earlier manual models.json + custom extension profile, not the upstream mitsuhiko/pi-ds4 path. They are useful as a baseline for Pi harness overhead, but the default public setup now starts with the upstream extension.

Setting	Pass	Avg	P50	OK avg	Read avg	Edit avg
16k ctx / 2048 max	6/6	13.8 s	9.7 s	8.6 s	9.7 s	23.2 s
32k ctx / 2048 max	6/6	16.4 s	12.0 s	9.3 s	12.0 s	27.7 s
32k ctx / 4096 max	6/6	19.9 s	15.5 s	11.8 s	15.5 s	32.4 s
64k ctx / 4096 max	6/6	21.0 s	16.7 s	13.4 s	16.7 s	32.8 s

Legacy benchmark summary: 16k/2048 was the fastest fully passing manual profile. 32k/2048 also passed, but averaged about 18% slower. 32k/4096 and 64k/4096 were slower with no reliability gain. The raw endpoint is fast; coding-agent calls are slower because they carry the agent contract and tool schemas.

8. Privacy and gotchas

Keep ds4-server bound to 127.0.0.1 unless you have deliberate authentication and firewall rules.
Leave trace logging off by default; traces can persist prompts, outputs, and tool calls.
Use named side profiles. Do not override claude, codex, or pi.
Do not publish local usernames, hostnames, LAN IPs, known-host fingerprints, or absolute machine paths in guide examples.
Do not run installers in parallel if they append to the same shell startup file.