ds4 on DGX Spark

DeepSeek V4 Flash on NVIDIA GB10 with antirez/ds4 CUDA, q2-imatrix, MTP, localhost serving, SSH tunnels, and ds4 agent profiles.

Last updated May 13, 2026

Local Inference CUDA

1. Where this fits

ds4 is a small native inference engine for DeepSeek V4 Flash. The Mac guide covers the Apple Silicon Metal path. This page covers the Spark path: upstream antirez/ds4, built with the CUDA Spark target and served from the DGX Spark over localhost.

Use it for	Do not confuse it with
DeepSeek V4 Flash quality tests on Spark	The generic Spark Ollama setup
Private long-context local chat behind SSH	Spark llama.cpp profiles on port `18080`
Side-by-side agent reliability tests	The default local coding profile
q2-imatrix + MTP experiments	The older plain q2 download path

Promotion rule: passing a one-token OK smoke test is not enough. Use deepseek-chat for non-thinking practical tests, then judge it on code, JSON, citation, abstention, long-prefill, and wiki tasks.

2. Build ds4

On Spark, define placeholders for examples and verify CUDA before building:

export SPARK_HOST=<spark-ip-or-hostname>
export SPARK_USER=<spark-user>
ssh -F none "$SPARK_USER@$SPARK_HOST" 'nvidia-smi'

Clone and build the CUDA Spark target:

ssh -F none "$SPARK_USER@$SPARK_HOST"

mkdir -p ~/src-repo
git clone https://github.com/antirez/ds4.git ~/src-repo/ds4
cd ~/src-repo/ds4
make cuda-spark -j"$(nproc)"

Confirm the server binary exists and exposes the expected options:

./ds4-server --help | sed -n '1,80p'

You should see model flags, --cuda, --ctx, --mtp, HTTP host/port flags, and disk KV cache flags.

3. Download and verify

Use the official antirez q2-imatrix file plus the optional MTP support file:

cd ~/src-repo/ds4
./download_model.sh q2-imatrix
./download_model.sh mtp

The main GGUF is roughly 81GB. If the Hugging Face transfer stalls, rerun the same command; the script downloads into a .part file and resumes with curl.

Verify the exact hashes before serving:

sha256sum \
  gguf/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf \
  gguf/DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf

Expected:

efc7ed607ff27076e3e501fc3fefefa33c0ed8cf1eff483a2b7fdc0c2e616668  DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf
afd481ee689dce9037f70f39085fcdae5a5b096d521cdad43b19fa52bf8f4083  DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf

Do not skip the hash check. An 80GB model download can fail in boring ways: partial file, stale .part, proxy interruption, or a silent resume mistake.

4. Start ds4-server

Create a repeatable launcher on Spark:

cat > ~/ds4-server-cuda.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail

DS4_DIR="${DS4_DIR:-$HOME/src-repo/ds4}"
HOST="${DS4_HOST:-127.0.0.1}"
PORT="${DS4_PORT:-8000}"
CTX="${DS4_CTX:-100000}"
KV_DIR="${DS4_KV_DIR:-$HOME/.cache/ds4-kv}"
KV_MB="${DS4_KV_MB:-8192}"
MODEL="${DS4_MODEL:-$DS4_DIR/ds4flash.gguf}"
MTP="${DS4_MTP:-$DS4_DIR/gguf/DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf}"

mkdir -p "$KV_DIR"
cd "$DS4_DIR"

args=(./ds4-server --cuda -m "$MODEL" --ctx "$CTX" \
  --kv-disk-dir "$KV_DIR" --kv-disk-space-mb "$KV_MB" \
  --host "$HOST" --port "$PORT")

if [[ -s "$MTP" ]]; then
  args+=(--mtp "$MTP")
fi

exec "${args[@]}" "$@"
EOF
chmod +x ~/ds4-server-cuda.sh

Start it from tmux, screen, or nohup:

cd ~/src-repo/ds4
mkdir -p logs
nohup ~/ds4-server-cuda.sh > logs/ds4-server.log 2>&1 < /dev/null &
echo $! > logs/ds4-server.pid

Watch the startup log:

tail -f ~/src-repo/ds4/logs/ds4-server.log

Good signs:

ds4: MTP support model loaded
ds4: CUDA backend initialized on NVIDIA GB10
ds4-server: context buffers ... (ctx=100000, backend=cuda)

CUDA host registration skipped: operation not supported can appear in this setup and is not fatal.

5. Smoke test the API

From Spark, verify that the server is listening:

curl -fsS http://127.0.0.1:8000/v1/models

Run the smallest deterministic chat request:

curl -sS http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "deepseek-chat",
    "messages": [{"role": "user", "content": "Reply with exactly: OK"}],
    "max_tokens": 8,
    "temperature": 0
  }'

A good response returns OK and a tiny usage block. If this fails, do not debug the agent profile yet; debug the server first.

6. Tunnel to a client

The safe default is a Spark-local server plus an SSH tunnel from your Mac or workstation:

ssh -F none -N -L 8000:127.0.0.1:8000 "$SPARK_USER@$SPARK_HOST"

Read that as:

client localhost:8000 -> SSH connection -> Spark localhost:8000 -> ds4-server

The 127.0.0.1 after -L 8000: is resolved on Spark after SSH connects. It is not asking your Mac to connect to itself.

If your client already has something on local port 8000, use another local port:

ssh -F none -N -L 18000:127.0.0.1:8000 "$SPARK_USER@$SPARK_HOST"

Generic OpenAI-compatible clients should then use:

http://127.0.0.1:18000/v1

Fixed local ds4 profiles that expect 127.0.0.1:8000 either need the 8000 tunnel or a profile-specific base URL override.

7. Agent profiles

Start with transport, then test tool use:

pi-ds4-bench
pi-ds4-direct

pi-ds4-bench is a smoke test. pi-ds4-direct is the first useful test because it exercises the ds4 path as a coding agent without adding extra sandbox friction.

Profile	Use for
`pi-ds4-bench`	Fast endpoint and profile smoke test
`pi-ds4-direct`	First real ds4 coding-agent test
`claude-ds4`	Follow-up Anthropic-compatible client test
`codex-ds4`	Follow-up Responses-adapter test

Wrong profiles for this test: pi-spark-llama, csllama, and xsllama point at Spark llama.cpp on local port 18080. They do not test the ds4 server.

8. Benchmarks

Measured on May 13, 2026 with upstream antirez/ds4, the CUDA Spark build, q2-imatrix, MTP enabled, ctx=100000, and a localhost SSH tunnel to a Spark-local server.

Prompt / output	Prefill	Generation	Peak generation	TTFR
512 / 64	405.91 tok/s	29.96 tok/s	34.67 tok/s	1.45 s
2048 / 64	400.76 tok/s	28.29 tok/s	31.67 tok/s	5.34 s
4096 / 256	386.82 tok/s	22.42 tok/s	34.67 tok/s	11.00 s

Against the earlier Mac ds4 pp4096/tg256 run, Spark ds4 was +29.7% on prefill, -18.7% on generation, +20.9% on peak generation, and 23.4% lower on TTFR. Against the M3 screenshot baseline, Spark ds4 was +55.6% on prefill, +12.3% on generation, and 30.9% lower on TTFR.

Practical mode	Model field	Pass	Median wall	Total wall	Notes
Spark ds4 non-thinking	`deepseek-chat`	13/14	5007 ms	80.0 s	Main agent-style setting.
Spark ds4 default/thinking	`deepseek-v4-flash`	7/14	12202 ms	186.4 s	Hidden thinking spent tight output budgets.
Prior Mac ds4	`deepseek-v4-flash`	9/14	5853 ms	81.3 s	Earlier Apple-side practical baseline.

The one non-thinking practical miss was the missing-context abstention wording gate. The answer was substantively correct, but it did not include one of the benchmark's accepted phrases.

Spark llama.cpp remains the cleanest current default because it passed 14/14 with much lower median wall time. Spark ds4 non-thinking is now a strong side profile candidate worth repeating with full agent tool-loop tests.

9. Privacy and gotchas

Keep ds4-server bound to 127.0.0.1 on Spark by default.
Prefer SSH tunnels over DS4_HOST=0.0.0.0.
Verify model hashes after download.
Leave --trace off unless you intentionally want prompts, outputs, and tool calls written to disk.
Use placeholders in public docs: no real LAN IPs, hostnames, usernames, SSH fingerprints, or absolute private paths.
Start or tunnel the ds4 server before launching client profiles; the profiles are clients, not server launchers.

antirez/ds4 -> make cuda-spark -> q2-imatrix + MTP -> Spark localhost -> SSH tunnel -> ds4 agent profile