ds4 on DGX Spark
DeepSeek V4 Flash on NVIDIA GB10 with antirez/ds4 CUDA, q2-imatrix, MTP, localhost serving, SSH tunnels, and ds4 agent profiles.
Last updated
Local Inference CUDA1. Where this fits
ds4 is a small native inference engine for DeepSeek V4 Flash. The Mac guide covers the Apple Silicon Metal path. This page covers the Spark path: upstream antirez/ds4, built with the CUDA Spark target and served from the DGX Spark over localhost.
| Use it for | Do not confuse it with |
|---|---|
| DeepSeek V4 Flash quality tests on Spark | The generic Spark Ollama setup |
| Private long-context local chat behind SSH | Spark llama.cpp profiles on port 18080 |
| Side-by-side agent reliability tests | The default local coding profile |
| q2-imatrix + MTP experiments | The older plain q2 download path |
OK smoke test is not enough. Use deepseek-chat for non-thinking practical tests, then judge it on code, JSON, citation, abstention, long-prefill, and wiki tasks.
2. Build ds4
On Spark, define placeholders for examples and verify CUDA before building:
export SPARK_HOST=<spark-ip-or-hostname>
export SPARK_USER=<spark-user>
ssh -F none "$SPARK_USER@$SPARK_HOST" 'nvidia-smi'
Clone and build the CUDA Spark target:
ssh -F none "$SPARK_USER@$SPARK_HOST"
mkdir -p ~/src-repo
git clone https://github.com/antirez/ds4.git ~/src-repo/ds4
cd ~/src-repo/ds4
make cuda-spark -j"$(nproc)"
Confirm the server binary exists and exposes the expected options:
./ds4-server --help | sed -n '1,80p'
You should see model flags, --cuda, --ctx, --mtp, HTTP host/port flags, and disk KV cache flags.
3. Download and verify
Use the official antirez q2-imatrix file plus the optional MTP support file:
cd ~/src-repo/ds4
./download_model.sh q2-imatrix
./download_model.sh mtp
The main GGUF is roughly 81GB. If the Hugging Face transfer stalls, rerun the same command; the script downloads into a .part file and resumes with curl.
Verify the exact hashes before serving:
sha256sum \
gguf/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf \
gguf/DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf
Expected:
efc7ed607ff27076e3e501fc3fefefa33c0ed8cf1eff483a2b7fdc0c2e616668 DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf
afd481ee689dce9037f70f39085fcdae5a5b096d521cdad43b19fa52bf8f4083 DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf
.part, proxy interruption, or a silent resume mistake.
4. Start ds4-server
Create a repeatable launcher on Spark:
cat > ~/ds4-server-cuda.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
DS4_DIR="${DS4_DIR:-$HOME/src-repo/ds4}"
HOST="${DS4_HOST:-127.0.0.1}"
PORT="${DS4_PORT:-8000}"
CTX="${DS4_CTX:-100000}"
KV_DIR="${DS4_KV_DIR:-$HOME/.cache/ds4-kv}"
KV_MB="${DS4_KV_MB:-8192}"
MODEL="${DS4_MODEL:-$DS4_DIR/ds4flash.gguf}"
MTP="${DS4_MTP:-$DS4_DIR/gguf/DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf}"
mkdir -p "$KV_DIR"
cd "$DS4_DIR"
args=(./ds4-server --cuda -m "$MODEL" --ctx "$CTX" \
--kv-disk-dir "$KV_DIR" --kv-disk-space-mb "$KV_MB" \
--host "$HOST" --port "$PORT")
if [[ -s "$MTP" ]]; then
args+=(--mtp "$MTP")
fi
exec "${args[@]}" "$@"
EOF
chmod +x ~/ds4-server-cuda.sh
Start it from tmux, screen, or nohup:
cd ~/src-repo/ds4
mkdir -p logs
nohup ~/ds4-server-cuda.sh > logs/ds4-server.log 2>&1 < /dev/null &
echo $! > logs/ds4-server.pid
Watch the startup log:
tail -f ~/src-repo/ds4/logs/ds4-server.log
Good signs:
ds4: MTP support model loaded
ds4: CUDA backend initialized on NVIDIA GB10
ds4-server: context buffers ... (ctx=100000, backend=cuda)
CUDA host registration skipped: operation not supported can appear in this setup and is not fatal.
5. Smoke test the API
From Spark, verify that the server is listening:
curl -fsS http://127.0.0.1:8000/v1/models
Run the smallest deterministic chat request:
curl -sS http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "deepseek-chat",
"messages": [{"role": "user", "content": "Reply with exactly: OK"}],
"max_tokens": 8,
"temperature": 0
}'
A good response returns OK and a tiny usage block. If this fails, do not debug the agent profile yet; debug the server first.
6. Tunnel to a client
The safe default is a Spark-local server plus an SSH tunnel from your Mac or workstation:
ssh -F none -N -L 8000:127.0.0.1:8000 "$SPARK_USER@$SPARK_HOST"
Read that as:
The 127.0.0.1 after -L 8000: is resolved on Spark after SSH connects. It is not asking your Mac to connect to itself.
If your client already has something on local port 8000, use another local port:
ssh -F none -N -L 18000:127.0.0.1:8000 "$SPARK_USER@$SPARK_HOST"
Generic OpenAI-compatible clients should then use:
http://127.0.0.1:18000/v1
Fixed local ds4 profiles that expect 127.0.0.1:8000 either need the 8000 tunnel or a profile-specific base URL override.
7. Agent profiles
Start with transport, then test tool use:
pi-ds4-bench
pi-ds4-direct
pi-ds4-bench is a smoke test. pi-ds4-direct is the first useful test because it exercises the ds4 path as a coding agent without adding extra sandbox friction.
| Profile | Use for |
|---|---|
pi-ds4-bench | Fast endpoint and profile smoke test |
pi-ds4-direct | First real ds4 coding-agent test |
claude-ds4 | Follow-up Anthropic-compatible client test |
codex-ds4 | Follow-up Responses-adapter test |
pi-spark-llama, csllama, and xsllama point at Spark llama.cpp on local port 18080. They do not test the ds4 server.
8. Benchmarks
Measured on May 13, 2026 with upstream antirez/ds4, the CUDA Spark build, q2-imatrix, MTP enabled, ctx=100000, and a localhost SSH tunnel to a Spark-local server.
| Prompt / output | Prefill | Generation | Peak generation | TTFR |
|---|---|---|---|---|
| 512 / 64 | 405.91 tok/s | 29.96 tok/s | 34.67 tok/s | 1.45 s |
| 2048 / 64 | 400.76 tok/s | 28.29 tok/s | 31.67 tok/s | 5.34 s |
| 4096 / 256 | 386.82 tok/s | 22.42 tok/s | 34.67 tok/s | 11.00 s |
Against the earlier Mac ds4 pp4096/tg256 run, Spark ds4 was +29.7% on prefill, -18.7% on generation, +20.9% on peak generation, and 23.4% lower on TTFR. Against the M3 screenshot baseline, Spark ds4 was +55.6% on prefill, +12.3% on generation, and 30.9% lower on TTFR.
| Practical mode | Model field | Pass | Median wall | Total wall | Notes |
|---|---|---|---|---|---|
| Spark ds4 non-thinking | deepseek-chat | 13/14 | 5007 ms | 80.0 s | Main agent-style setting. |
| Spark ds4 default/thinking | deepseek-v4-flash | 7/14 | 12202 ms | 186.4 s | Hidden thinking spent tight output budgets. |
| Prior Mac ds4 | deepseek-v4-flash | 9/14 | 5853 ms | 81.3 s | Earlier Apple-side practical baseline. |
The one non-thinking practical miss was the missing-context abstention wording gate. The answer was substantively correct, but it did not include one of the benchmark's accepted phrases.
Spark llama.cpp remains the cleanest current default because it passed 14/14 with much lower median wall time. Spark ds4 non-thinking is now a strong side profile candidate worth repeating with full agent tool-loop tests.
9. Privacy and gotchas
- Keep
ds4-serverbound to127.0.0.1on Spark by default. - Prefer SSH tunnels over
DS4_HOST=0.0.0.0. - Verify model hashes after download.
- Leave
--traceoff unless you intentionally want prompts, outputs, and tool calls written to disk. - Use placeholders in public docs: no real LAN IPs, hostnames, usernames, SSH fingerprints, or absolute private paths.
- Start or tunnel the ds4 server before launching client profiles; the profiles are clients, not server launchers.