Local LLM models — Kerno Docs

Run Kerno fully offline against Ollama, llama.cpp, or vLLM. No data ever leaves your machine.

Kerno's LLM provider is OpenAI-compatible. Anything that speaks /v1/chat/completions works — including local runtimes like Ollama, llama.cpp, vLLM, LM Studio, and LocalAI.

Why bother: zero data egress, zero per-token cost, works offline. Trade-off: smaller models give lower-quality replies, slower than Mistral / Anthropic, and no native vision unless you pick a multimodal model.

Ollama (recommended starting point)

Easiest path. Single binary, model registry, OpenAI-compatible endpoint baked in.

Install + pull a model

# macOS / Windows: download from ollama.com and run.
# Linux:
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model. Pick by hardware:
ollama pull llama3.1:8b      # 8B params  — laptops with 16+ GB RAM
ollama pull llama3.1:70b     # 70B params — desktops with 64+ GB RAM (or M2 Ultra/M3 Max)
ollama pull qwen2.5:32b      # strong on code, smaller than 70B

Point Kerno at it

/admin → LLM section, fill in:

Field	Value
LLM API Key	`ollama` (anything works — Ollama doesn't validate)
LLM Base URL	`http://host.docker.internal:11434/v1` (Mac/Windows) or `http://172.17.0.1:11434/v1` (Linux Docker)
Main Model	`llama3.1:70b` (or whatever you pulled)
Fast Model	`llama3.1:8b`

Docker networking note: Kerno runs in a container, Ollama runs on the host. localhost from inside the container points at the container itself. Use host.docker.internal (Mac/Windows) or your Docker bridge IP (172.17.0.1 is the default on Linux). Verify: docker exec kerno curl http://host.docker.internal:11434.

Save → restart container → Kerno is now fully local.

llama.cpp (lighter, faster on Apple Silicon)

# Build from source
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
cmake -B build && cmake --build build -j

# Download a GGUF quantization
huggingface-cli download bartowski/Meta-Llama-3.1-70B-Instruct-GGUF \
  Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf --local-dir models/

# Start the OpenAI-compatible server
./build/bin/llama-server \
  -m models/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf \
  --port 8080 \
  --host 0.0.0.0 \
  -c 16384      # context length

Then in /admin:

Field	Value
LLM Base URL	`http://host.docker.internal:8080/v1`
Main Model	(anything — llama.cpp ignores the model name)

vLLM (production-grade, multi-GPU)

pip install vllm
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
  --port 8080 --host 0.0.0.0 \
  --tensor-parallel-size 2   # use 2 GPUs

Same config as llama.cpp. vLLM gives you batched inference + paged attention — best throughput for multi-user setups.

Embeddings caveat

Semantic memory needs an embedding model that returns 1024-dim vectors (the pgvector column is fixed at 1024). Most local runtimes don't ship a hosted embedding endpoint.

Two options:

Option A — keep Mistral for embeddings (simplest). Get a free Mistral key at console.mistral.ai. In /admin → LLM:

Field	Value
Embeddings API Key	`<your Mistral key>`
Embeddings Base URL	`https://api.mistral.ai/v1`
Embeddings Model	`mistral-embed`

Only embed-time data (chat history, daily memory chunks) ever leaves your network. The chat content itself stays local.

Option B — embeddings via Ollama (fully offline). Pull an embedding model and configure it:

ollama pull nomic-embed-text:v1.5  # 768-dim — won't work
ollama pull mxbai-embed-large       # 1024-dim — works

Then in /admin → LLM:

Field	Value
Embeddings Base URL	`http://host.docker.internal:11434/v1`
Embeddings Model	`mxbai-embed-large`

Embeddings are now local too.

Performance expectations

Benchmarks vary by hardware. Rough guide for chat-style queries:

Setup	Tokens/sec	First-token latency
Ollama llama3.1:8b on M2 Pro	~30	~1s
Ollama llama3.1:70b on M3 Max	~6	~3s
llama.cpp Q4_K_M 70B on M2 Ultra	~10	~2s
vLLM 70B on 2× A100	~80	~0.5s

Mistral / OpenAI cloud ~80–150 tok/s with sub-second first-token. If you care about responsiveness over privacy, hybrid is fine — local for chat, cloud for embeddings.

Caveats

Tool use quality — smaller models (8B–13B) often hallucinate tool args or skip tool calls entirely. 70B models are noticeably better. Llama 3.1 70B and Qwen 2.5 32B are the current sweet spots.
No vision out of the box — image upload + Mistral multimodal won't work unless you swap to a vision-capable local model (LLaVA, Llama 3.2 Vision).
Streaming works against any compliant /v1/chat/completions server.
Voice TTS / STT still uses Mistral Voxtral by default. To go fully offline, swap to whisper-cpp (already bundled) and piper (also bundled). That's planned to be configurable in a future release.

Troubleshooting

Symptom	Likely cause
`ECONNREFUSED 127.0.0.1:11434`	Used `localhost` instead of `host.docker.internal`. Inside Docker, localhost is the container.
`model not found`	Pulled a different name than what you configured. `ollama list` to see what's there.
First reply takes 60+ seconds	Cold start — the model is loading into VRAM. Subsequent replies are fast.
Replies cut off mid-sentence	Context length too small. Increase `--ctx-size` (llama.cpp) or `OLLAMA_NUM_CTX` (Ollama).
Embedding fails with `dimensions: 1024` error	Local embedding model returns a different dimension. Pick one that natively returns 1024 (`mxbai-embed-large`) or stick with Mistral.