ds4

DeepSeek V4 Flash on a 128GB Apple Silicon Mac, with side profiles for Claude Code, Codex, and Pi.

Last updated

Local Inference Metal

1. Where ds4 fits

ds4 is a Metal-only inference engine for DeepSeek V4 Flash GGUF files. The practical target is a 128GB Apple Silicon Mac: large enough for the q2 model and fast enough to be worth comparing against smaller local Qwen models.

Use it as a named side engine first. Do not replace your default Claude, Codex, or Pi setup until it survives real tool-use and repo-editing tests.

Use it forAvoid it for now
Private long-context local chatDefault day-to-day coding edits
DeepSeek V4 Flash quality experimentsMulti-user serving
Side-by-side local model testsCUDA or DGX Spark workflows
Local wiki/research promptsAnything with tracing enabled by accident

2. Build and download

Build from source:

mkdir -p ~/src-repo
git clone https://github.com/antirez/ds4.git ~/src-repo/ds4
cd ~/src-repo/ds4
make

Download the q2 model:

./download_model.sh q2
Do not start with q4 on a 128GB Mac. The q4 file is too large once runtime overhead is included. Start with q2.

Smoke test the CLI:

./ds4 --ctx 32768 --nothink --temp 0 -n 16 -p "reply with ok"

Known-good local smoke numbers from one setup:

PromptObserved result
Short ok promptPrefill 28.96 t/s, generation 8.93 t/s
256-token explanationPrefill 61.21 t/s, generation 38.57 t/s
Warm weightsPrefill 76.92 t/s, generation 37.81 t/s

3. Start the server

mkdir -p ~/.cache/ds4-kv
cd ~/src-repo/ds4
./ds4-server \
  --ctx 100000 \
  --kv-disk-dir ~/.cache/ds4-kv \
  --kv-disk-space-mb 8192 \
  --host 127.0.0.1 \
  --port 8000
SettingStart withWhy
Quantq2The practical 128GB target
Server context100000Large enough for direct long-context experiments
Codex client context80000Keeps Codex below the observed Metal KV decode ceiling
Pi ds4 pathmitsuhiko/pi-ds4 with --no-toolsStable chat/transport path; tool mode is still experimental
ThinkingOff for speed testsCompare clean latency first
Disk KVOnHelps repeated agent prompts
TraceOffTrace can persist prompts and outputs

Do not let coding-agent clients consume the full advertised 100k window by default unless the client harness has been tested there. On this setup, a Codex request around 97k prompt tokens completed prefill and then failed during decode with a Metal compressed KV cache capacity error. Use an 80k advertised client window and compact around 64k for Codex side profiles. For Pi, prefer the upstream mitsuhiko/pi-ds4 extension first, then benchmark real tasks before overriding its model metadata.

Verify models:

curl -fsS http://127.0.0.1:8000/v1/models

Verify OpenAI-compatible chat:

curl -fsS http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "deepseek-v4-flash",
    "messages": [{"role": "user", "content": "reply with ok"}],
    "max_tokens": 8,
    "temperature": 0,
    "think": false,
    "stream": false
  }'

Verify Anthropic-compatible messages:

curl -fsS http://127.0.0.1:8000/v1/messages \
  -H 'Content-Type: application/json' \
  -H 'anthropic-version: 2023-06-01' \
  -H 'x-api-key: local-test-key' \
  -d '{
    "model": "deepseek-v4-flash",
    "messages": [{"role": "user", "content": "reply with ok"}],
    "max_tokens": 8,
    "temperature": 0,
    "thinking": {"type": "disabled"},
    "stream": false
  }'

4. Claude Code profile

Claude Code can call ds4 directly through the Anthropic-compatible endpoint. Keep it as a side command:

claude-ds4
cds4

The important environment is:

export ANTHROPIC_BASE_URL=http://127.0.0.1:8000
export ANTHROPIC_API_KEY=local-test-key
export ANTHROPIC_AUTH_TOKEN=local-test-key
export ANTHROPIC_CUSTOM_MODEL_OPTION=deepseek-v4-flash

Set both ANTHROPIC_API_KEY and ANTHROPIC_AUTH_TOKEN. Bare mode expects API-key style auth; the auth-token variable keeps compatibility with local-gateway patterns.

Smoke test:

claude-ds4 --bare --no-session-persistence --tools '' \
  --system-prompt 'Reply only with the final answer.' \
  -p 'reply with exactly ok'

5. Codex profile

Current Codex custom providers expect the OpenAI Responses API. Since ds4 exposes Chat Completions and Anthropic Messages, the Codex side profile needs a small local adapter.

Codex -> local Responses adapter -> ds4 Chat Completions -> DeepSeek V4 Flash

Keep the adapter profile-local. Do not hide it inside global Codex config.

Smoke test:

codex-ds4 exec \
  --ephemeral \
  --skip-git-repo-check \
  -s read-only \
  --ignore-rules \
  'reply with exactly ok'
Tool schemas change the speed profile. Plain chat can be fast. A coding-capable Codex profile sends the agent contract and tool schemas, which can add thousands of prompt tokens before your actual prompt starts.

6. pi-ds4 profile

Pi can call ds4 through the upstream mitsuhiko/pi-ds4 extension. Keep that extension installed in a separate agent directory so the default Pi profile and cloud provider state stay untouched. Use this as a local chat and transport profile first; keep file-editing agents on the Claude/Codex ds4 profiles until direct ds4 tool requests are reliable.

Pi -> ds4 OpenAI Chat Completions -> DeepSeek V4 Flash

Use a side command:

pi-ds4

The default public Pi path is Armin Ronacher's mitsuhiko/pi-ds4 extension. It registers ds4/deepseek-v4-flash, starts ds4-server on demand, downloads/builds the runtime if needed, keeps a per-Pi-process lease, and exposes /ds4 for logs.

pi install https://github.com/mitsuhiko/pi-ds4
pi --model ds4/deepseek-v4-flash --thinking off -p 'reply with OK'

The agent-stack-bootstrap wrapper keeps that upstream extension in an isolated Pi state directory, launches the public profile with --no-tools, and gives it explicit side commands:

pi-ds4-install
pi-ds4 -p 'reply with OK'

The older custom ds4-tools.ts scaffold-guard setup is not the default public path. It is preserved on the archive/pi-ds4-custom-guard branch for reliability experiments.

Smoke test:

pi-ds4 -p 'reply with OK'

In print mode, current upstream pi-ds4 can show DeepSeek-style reasoning text before the final answer even when Pi is launched with --thinking off. Treat this as a transport smoke test: the important signal is that the command reaches the local ds4 server and the output ends with OK. In local testing, adding OpenAI-style tools to direct ds4 requests stalled the server, so the public wrapper keeps tools disabled by default.

For benchmark runs where the sandbox wrapper itself is being measured, keep an explicit no-nono bypass profile rather than weakening the normal profile:

pi-ds4-rawdog -p 'reply with OK'
pi-ds4-bench

7. Benchmarks

Benchmark the raw endpoint, a minimal CLI call, and a full coding-agent call separately.

PathWarm latencyWhat it measures
Raw ds4 OpenAI chat~123 msModel endpoint only
Raw ds4 Anthropic messages~121 msModel endpoint only
claude-ds4 bare/no-tools~2.0 sClaude CLI plus local endpoint
pi-ds4-bench no-tools smoke test~3-4 sPi CLI plus upstream extension, no nono, no tools
Direct ds4 tool requeststalled over 30 sOpenAI-style tools request to ds4-server; keep disabled by default
codex-ds4 coding profile~25-26 sFull Codex agent prompt plus tool schemas

Legacy manual Pi context sweep

These numbers are from the earlier manual models.json + custom extension profile, not the upstream mitsuhiko/pi-ds4 path. They are useful as a baseline for Pi harness overhead, but the default public setup now starts with the upstream extension.

SettingPassAvgP50OK avgRead avgEdit avg
16k ctx / 2048 max6/613.8 s9.7 s8.6 s9.7 s23.2 s
32k ctx / 2048 max6/616.4 s12.0 s9.3 s12.0 s27.7 s
32k ctx / 4096 max6/619.9 s15.5 s11.8 s15.5 s32.4 s
64k ctx / 4096 max6/621.0 s16.7 s13.4 s16.7 s32.8 s

Legacy benchmark summary: 16k/2048 was the fastest fully passing manual profile. 32k/2048 also passed, but averaged about 18% slower. 32k/4096 and 64k/4096 were slower with no reliability gain. The raw endpoint is fast; coding-agent calls are slower because they carry the agent contract and tool schemas.

8. Privacy and gotchas

  • Keep ds4-server bound to 127.0.0.1 unless you have deliberate authentication and firewall rules.
  • Leave trace logging off by default; traces can persist prompts, outputs, and tool calls.
  • Use named side profiles. Do not override claude, codex, or pi.
  • Do not publish local usernames, hostnames, LAN IPs, known-host fingerprints, or absolute machine paths in guide examples.
  • Do not run installers in parallel if they append to the same shell startup file.
Shell name -> side profile -> ds4 localhost server -> local model