llama.cpp

Local LLM inference with Metal GPU acceleration, sandboxed model paths.

Last updated

Local Inference C++

1. Installation

Prerequisites

  • macOS 13+ (Metal GPU) or Linux (kernel 5.13+ for Landlock)
  • GGUF model files (download separately)

Install via Homebrew

brew install llama.cpp

This installs llama-cli (interactive chat), llama-server (HTTP API), and 30+ other binaries. Metal GPU is enabled by default on macOS.

Or build from source

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j 8

Install the preferred stack

brew tap nvk/tap
brew install nvk/tap/agent-bondage
brew install nono

# Optional, for gated model pulls or local server auth:
brew install nvk/tap/envchain-xtra

Verify

llama-cli --version
llama-server --version
bondage --help
nono --version

2. nono Profile

llama.cpp is fundamentally different from cloud API tools. There are no mandatory API keys — it runs entirely on your hardware. In the preferred setup, bondage decides which exact binary and mode you are launching, while nono restricts filesystem access to the model directory and controls network exposure for server mode.

What the sandbox allows

ResourceAccessWhy
Model directory (e.g., ~/models/)ReadLoad GGUF weights
~/Library/Caches/llama.cpp/Read + WriteHuggingFace download cache
Metal/Accelerate frameworksReadGPU compute
localhost:8080 (server mode)NetworkOpenAI-compatible API
huggingface.coNetworkModel downloads (optional)

What the sandbox blocks

ResourceWhy blocked
~/.ssh/, ~/.aws/, ~/.gnupg/Credentials
~/Documents/, ~/Desktop/Personal files
All other networkNo lateral movement
Air-gapped mode: Set LLAMA_OFFLINE=1 to disable all network access after models are downloaded. The sandbox can then block all outbound traffic.
No config files: llama.cpp has no config files. Everything is CLI flags or LLAMA_ARG_* env vars. Nothing to protect on disk beyond model weights.

3. Optional envchain-xtra

envchain-xtra is optional for llama.cpp — there are no mandatory API keys. Use it for two scenarios:

HuggingFace token (gated model downloads)

Some models (Llama 3, Gemma, etc.) require a HuggingFace access token:

envchain --set huggingface HF_TOKEN

Server API key (protect your local endpoint)

If running llama-server, you can require an API key for incoming requests:

envchain --set llama LLAMA_API_KEY
When you don't need envchain: If you only use freely available models (Qwen, Phi, etc.) and don't expose the server endpoint, skip envchain entirely.

4. bondage Wrapper

For interactive chat (llama-cli)

llama-chat() {
  bondage exec llama-chat ~/.config/bondage/bondage.conf -- "$@"
}

Use the profile config to pin the exact llama-cli target and the model-root access policy.

Sample stack snippets

Assuming your shared [global] block already exists in ~/.config/bondage/bondage.conf, this is a minimal chat profile to adapt:

# ~/.config/bondage/bondage.conf
[profile "llama-chat"]
use_envchain = false
use_nono = true
nono_profile = llama-chat
touch_policy = none
target_kind = native
target = /absolute/path/to/llama-cli
target_fp = sha256:replace-me
nono_allow_cwd = true
nono_allow_file = /dev/tty
nono_allow_file = /dev/null
nono_read_file = /dev/urandom
{
  "extends": "default",
  "meta": {
    "name": "llama-chat",
    "description": "llama.cpp chat with read-only model access"
  },
  "policy": {
    "add_deny_access": ["/Volumes"],
    "add_allow_read": [
      "$HOME/models/llama.cpp"
    ]
  },
  "workdir": {
    "access": "readwrite"
  }
}

For server mode or gated-model download profiles, duplicate the same shape with a different profile name and add only the extra model or token paths you actually need.

For server mode (llama-server)

llama-serve() {
  bondage exec llama-serve ~/.config/bondage/bondage.conf -- "$@"
}

If you use server auth or gated downloads, the relevant profile can add envchain-xtra without changing the shell wrapper.

For downloading gated models

llama-pull() {
  bondage exec llama-pull ~/.config/bondage/bondage.conf -- "$@"
}

The shell stays thin; the real policy lives in the profile config.

Reload your shell:

source ~/.zshrc

5. Verification

Test interactive chat

bondage verify llama-chat ~/.config/bondage/bondage.conf
bondage chain llama-chat ~/.config/bondage/bondage.conf -- --help

Test server mode

# Terminal 1: start server
llama-serve your-model.gguf

# Terminal 2: test API
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"local","messages":[{"role":"user","content":"hello"}]}'

Confirm Metal GPU is active

# Look for "Metal" in the startup output:
# ggml_metal_init: found device: Apple M1/M2/M3/M4
# ggml_metal_init: GPU layers: 33

Troubleshooting

SymptomCauseFix
"Failed to open file" nono blocking model path Check --read-file points to correct directory
No GPU acceleration Metal not available in sandbox nono allows Metal by default on macOS
"401 Unauthorized" on HF download HF_TOKEN not set or expired envchain --set huggingface HF_TOKEN
Shell name → bondage → [envchain-xtra] → nono → llama.cpp Convenience Launch policy Optional secrets Kernel sandbox Local inference