Kerno's LLM provider is OpenAI-compatible. Anything that speaks /v1/chat/completions works — including local runtimes like Ollama, llama.cpp, vLLM, LM Studio, and LocalAI.
Why bother: zero data egress, zero per-token cost, works offline. Trade-off: smaller models give lower-quality replies, slower than Mistral / Anthropic, and no native vision unless you pick a multimodal model.
Ollama (recommended starting point)
Easiest path. Single binary, model registry, OpenAI-compatible endpoint baked in.
Install + pull a model
# macOS / Windows: download from ollama.com and run.
# Linux:
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model. Pick by hardware:
ollama pull llama3.1:8b # 8B params — laptops with 16+ GB RAM
ollama pull llama3.1:70b # 70B params — desktops with 64+ GB RAM (or M2 Ultra/M3 Max)
ollama pull qwen2.5:32b # strong on code, smaller than 70B
Point Kerno at it
/admin → LLM section, fill in:
| Field | Value |
|---|---|
| LLM API Key | ollama (anything works — Ollama doesn't validate) |
| LLM Base URL | http://host.docker.internal:11434/v1 (Mac/Windows) or http://172.17.0.1:11434/v1 (Linux Docker) |
| Main Model | llama3.1:70b (or whatever you pulled) |
| Fast Model | llama3.1:8b |
Docker networking note: Kerno runs in a container, Ollama runs on the host.
localhostfrom inside the container points at the container itself. Usehost.docker.internal(Mac/Windows) or your Docker bridge IP (172.17.0.1is the default on Linux). Verify:docker exec kerno curl http://host.docker.internal:11434.
Save → restart container → Kerno is now fully local.
llama.cpp (lighter, faster on Apple Silicon)
# Build from source
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
cmake -B build && cmake --build build -j
# Download a GGUF quantization
huggingface-cli download bartowski/Meta-Llama-3.1-70B-Instruct-GGUF \
Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf --local-dir models/
# Start the OpenAI-compatible server
./build/bin/llama-server \
-m models/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf \
--port 8080 \
--host 0.0.0.0 \
-c 16384 # context length
Then in /admin:
| Field | Value |
|---|---|
| LLM Base URL | http://host.docker.internal:8080/v1 |
| Main Model | (anything — llama.cpp ignores the model name) |
vLLM (production-grade, multi-GPU)
pip install vllm
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
--port 8080 --host 0.0.0.0 \
--tensor-parallel-size 2 # use 2 GPUs
Same config as llama.cpp. vLLM gives you batched inference + paged attention — best throughput for multi-user setups.
Embeddings caveat
Semantic memory needs an embedding model that returns 1024-dim vectors (the pgvector column is fixed at 1024). Most local runtimes don't ship a hosted embedding endpoint.
Two options:
Option A — keep Mistral for embeddings (simplest). Get a free Mistral key at console.mistral.ai. In /admin → LLM:
| Field | Value |
|---|---|
| Embeddings API Key | <your Mistral key> |
| Embeddings Base URL | https://api.mistral.ai/v1 |
| Embeddings Model | mistral-embed |
Only embed-time data (chat history, daily memory chunks) ever leaves your network. The chat content itself stays local.
Option B — embeddings via Ollama (fully offline). Pull an embedding model and configure it:
ollama pull nomic-embed-text:v1.5 # 768-dim — won't work
ollama pull mxbai-embed-large # 1024-dim — works
Then in /admin → LLM:
| Field | Value |
|---|---|
| Embeddings Base URL | http://host.docker.internal:11434/v1 |
| Embeddings Model | mxbai-embed-large |
Embeddings are now local too.
Performance expectations
Benchmarks vary by hardware. Rough guide for chat-style queries:
| Setup | Tokens/sec | First-token latency |
|---|---|---|
| Ollama llama3.1:8b on M2 Pro | ~30 | ~1s |
| Ollama llama3.1:70b on M3 Max | ~6 | ~3s |
| llama.cpp Q4_K_M 70B on M2 Ultra | ~10 | ~2s |
| vLLM 70B on 2× A100 | ~80 | ~0.5s |
Mistral / OpenAI cloud ~80–150 tok/s with sub-second first-token. If you care about responsiveness over privacy, hybrid is fine — local for chat, cloud for embeddings.
Caveats
- Tool use quality — smaller models (8B–13B) often hallucinate tool args or skip tool calls entirely. 70B models are noticeably better. Llama 3.1 70B and Qwen 2.5 32B are the current sweet spots.
- No vision out of the box — image upload + Mistral multimodal won't work unless you swap to a vision-capable local model (LLaVA, Llama 3.2 Vision).
- Streaming works against any compliant
/v1/chat/completionsserver. - Voice TTS / STT still uses Mistral Voxtral by default. To go fully offline, swap to
whisper-cpp(already bundled) andpiper(also bundled). That's planned to be configurable in a future release.
Troubleshooting
| Symptom | Likely cause |
|---|---|
ECONNREFUSED 127.0.0.1:11434 | Used localhost instead of host.docker.internal. Inside Docker, localhost is the container. |
model not found | Pulled a different name than what you configured. ollama list to see what's there. |
| First reply takes 60+ seconds | Cold start — the model is loading into VRAM. Subsequent replies are fast. |
| Replies cut off mid-sentence | Context length too small. Increase --ctx-size (llama.cpp) or OLLAMA_NUM_CTX (Ollama). |
Embedding fails with dimensions: 1024 error | Local embedding model returns a different dimension. Pick one that natively returns 1024 (mxbai-embed-large) or stick with Mistral. |