llama.cpp
Local LLM inference with Metal GPU acceleration, sandboxed model paths.
Last updated
Local Inference C++1. Installation
Prerequisites
- macOS 13+ (Metal GPU) or Linux (kernel 5.13+ for Landlock)
- GGUF model files (download separately)
Install via Homebrew
brew install llama.cpp
This installs llama-cli (interactive chat), llama-server (HTTP API), and 30+ other binaries. Metal GPU is enabled by default on macOS.
Or build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j 8
Install the preferred stack
brew tap nvk/tap
brew install nvk/tap/agent-bondage
brew install nono
# Optional, for gated model pulls or local server auth:
brew install nvk/tap/envchain-xtra
Verify
llama-cli --version
llama-server --version
bondage --help
nono --version
2. nono Profile
llama.cpp is fundamentally different from cloud API tools. There are no mandatory API keys — it runs entirely on your hardware. In the preferred setup, bondage decides which exact binary and mode you are launching, while nono restricts filesystem access to the model directory and controls network exposure for server mode.
What the sandbox allows
| Resource | Access | Why |
|---|---|---|
Model directory (e.g., ~/models/) | Read | Load GGUF weights |
~/Library/Caches/llama.cpp/ | Read + Write | HuggingFace download cache |
| Metal/Accelerate frameworks | Read | GPU compute |
localhost:8080 (server mode) | Network | OpenAI-compatible API |
huggingface.co | Network | Model downloads (optional) |
What the sandbox blocks
| Resource | Why blocked |
|---|---|
~/.ssh/, ~/.aws/, ~/.gnupg/ | Credentials |
~/Documents/, ~/Desktop/ | Personal files |
| All other network | No lateral movement |
LLAMA_OFFLINE=1 to disable all network access after models are downloaded. The sandbox can then block all outbound traffic.
LLAMA_ARG_* env vars. Nothing to protect on disk beyond model weights.
3. Optional envchain-xtra
envchain-xtra is optional for llama.cpp — there are no mandatory API keys. Use it for two scenarios:
HuggingFace token (gated model downloads)
Some models (Llama 3, Gemma, etc.) require a HuggingFace access token:
envchain --set huggingface HF_TOKEN
Server API key (protect your local endpoint)
If running llama-server, you can require an API key for incoming requests:
envchain --set llama LLAMA_API_KEY
4. bondage Wrapper
For interactive chat (llama-cli)
llama-chat() {
bondage exec llama-chat ~/.config/bondage/bondage.conf -- "$@"
}
Use the profile config to pin the exact llama-cli target and the model-root access policy.
Sample stack snippets
Assuming your shared [global] block already exists in ~/.config/bondage/bondage.conf, this is a minimal chat profile to adapt:
# ~/.config/bondage/bondage.conf
[profile "llama-chat"]
use_envchain = false
use_nono = true
nono_profile = llama-chat
touch_policy = none
target_kind = native
target = /absolute/path/to/llama-cli
target_fp = sha256:replace-me
nono_allow_cwd = true
nono_allow_file = /dev/tty
nono_allow_file = /dev/null
nono_read_file = /dev/urandom
{
"extends": "default",
"meta": {
"name": "llama-chat",
"description": "llama.cpp chat with read-only model access"
},
"policy": {
"add_deny_access": ["/Volumes"],
"add_allow_read": [
"$HOME/models/llama.cpp"
]
},
"workdir": {
"access": "readwrite"
}
}
For server mode or gated-model download profiles, duplicate the same shape with a different profile name and add only the extra model or token paths you actually need.
For server mode (llama-server)
llama-serve() {
bondage exec llama-serve ~/.config/bondage/bondage.conf -- "$@"
}
If you use server auth or gated downloads, the relevant profile can add envchain-xtra without changing the shell wrapper.
For downloading gated models
llama-pull() {
bondage exec llama-pull ~/.config/bondage/bondage.conf -- "$@"
}
The shell stays thin; the real policy lives in the profile config.
Reload your shell:
source ~/.zshrc
5. Verification
Test interactive chat
bondage verify llama-chat ~/.config/bondage/bondage.conf
bondage chain llama-chat ~/.config/bondage/bondage.conf -- --help
Test server mode
# Terminal 1: start server
llama-serve your-model.gguf
# Terminal 2: test API
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"local","messages":[{"role":"user","content":"hello"}]}'
Confirm Metal GPU is active
# Look for "Metal" in the startup output:
# ggml_metal_init: found device: Apple M1/M2/M3/M4
# ggml_metal_init: GPU layers: 33
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| "Failed to open file" | nono blocking model path | Check --read-file points to correct directory |
| No GPU acceleration | Metal not available in sandbox | nono allows Metal by default on macOS |
| "401 Unauthorized" on HF download | HF_TOKEN not set or expired | envchain --set huggingface HF_TOKEN |