K
KernoDocs
Docs/Local LLM models

Local LLM models

Run Kerno fully offline against Ollama, llama.cpp, or vLLM. No data ever leaves your machine.

Kerno's LLM provider is OpenAI-compatible. Anything that speaks /v1/chat/completions works — including local runtimes like Ollama, llama.cpp, vLLM, LM Studio, and LocalAI.

Why bother: zero data egress, zero per-token cost, works offline. Trade-off: smaller models give lower-quality replies, slower than Mistral / Anthropic, and no native vision unless you pick a multimodal model.

Easiest path. Single binary, model registry, OpenAI-compatible endpoint baked in.

Install + pull a model

# macOS / Windows: download from ollama.com and run.
# Linux:
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model. Pick by hardware:
ollama pull llama3.1:8b      # 8B params  — laptops with 16+ GB RAM
ollama pull llama3.1:70b     # 70B params — desktops with 64+ GB RAM (or M2 Ultra/M3 Max)
ollama pull qwen2.5:32b      # strong on code, smaller than 70B

Point Kerno at it

/admin → LLM section, fill in:

FieldValue
LLM API Keyollama (anything works — Ollama doesn't validate)
LLM Base URLhttp://host.docker.internal:11434/v1 (Mac/Windows) or http://172.17.0.1:11434/v1 (Linux Docker)
Main Modelllama3.1:70b (or whatever you pulled)
Fast Modelllama3.1:8b

Docker networking note: Kerno runs in a container, Ollama runs on the host. localhost from inside the container points at the container itself. Use host.docker.internal (Mac/Windows) or your Docker bridge IP (172.17.0.1 is the default on Linux). Verify: docker exec kerno curl http://host.docker.internal:11434.

Save → restart container → Kerno is now fully local.

llama.cpp (lighter, faster on Apple Silicon)

# Build from source
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
cmake -B build && cmake --build build -j

# Download a GGUF quantization
huggingface-cli download bartowski/Meta-Llama-3.1-70B-Instruct-GGUF \
  Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf --local-dir models/

# Start the OpenAI-compatible server
./build/bin/llama-server \
  -m models/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf \
  --port 8080 \
  --host 0.0.0.0 \
  -c 16384      # context length

Then in /admin:

FieldValue
LLM Base URLhttp://host.docker.internal:8080/v1
Main Model(anything — llama.cpp ignores the model name)

vLLM (production-grade, multi-GPU)

pip install vllm
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
  --port 8080 --host 0.0.0.0 \
  --tensor-parallel-size 2   # use 2 GPUs

Same config as llama.cpp. vLLM gives you batched inference + paged attention — best throughput for multi-user setups.

Embeddings caveat

Semantic memory needs an embedding model that returns 1024-dim vectors (the pgvector column is fixed at 1024). Most local runtimes don't ship a hosted embedding endpoint.

Two options:

Option A — keep Mistral for embeddings (simplest). Get a free Mistral key at console.mistral.ai. In /admin → LLM:

FieldValue
Embeddings API Key<your Mistral key>
Embeddings Base URLhttps://api.mistral.ai/v1
Embeddings Modelmistral-embed

Only embed-time data (chat history, daily memory chunks) ever leaves your network. The chat content itself stays local.

Option B — embeddings via Ollama (fully offline). Pull an embedding model and configure it:

ollama pull nomic-embed-text:v1.5  # 768-dim — won't work
ollama pull mxbai-embed-large       # 1024-dim — works

Then in /admin → LLM:

FieldValue
Embeddings Base URLhttp://host.docker.internal:11434/v1
Embeddings Modelmxbai-embed-large

Embeddings are now local too.

Performance expectations

Benchmarks vary by hardware. Rough guide for chat-style queries:

SetupTokens/secFirst-token latency
Ollama llama3.1:8b on M2 Pro~30~1s
Ollama llama3.1:70b on M3 Max~6~3s
llama.cpp Q4_K_M 70B on M2 Ultra~10~2s
vLLM 70B on 2× A100~80~0.5s

Mistral / OpenAI cloud ~80–150 tok/s with sub-second first-token. If you care about responsiveness over privacy, hybrid is fine — local for chat, cloud for embeddings.

Caveats

  • Tool use quality — smaller models (8B–13B) often hallucinate tool args or skip tool calls entirely. 70B models are noticeably better. Llama 3.1 70B and Qwen 2.5 32B are the current sweet spots.
  • No vision out of the box — image upload + Mistral multimodal won't work unless you swap to a vision-capable local model (LLaVA, Llama 3.2 Vision).
  • Streaming works against any compliant /v1/chat/completions server.
  • Voice TTS / STT still uses Mistral Voxtral by default. To go fully offline, swap to whisper-cpp (already bundled) and piper (also bundled). That's planned to be configurable in a future release.

Troubleshooting

SymptomLikely cause
ECONNREFUSED 127.0.0.1:11434Used localhost instead of host.docker.internal. Inside Docker, localhost is the container.
model not foundPulled a different name than what you configured. ollama list to see what's there.
First reply takes 60+ secondsCold start — the model is loading into VRAM. Subsequent replies are fast.
Replies cut off mid-sentenceContext length too small. Increase --ctx-size (llama.cpp) or OLLAMA_NUM_CTX (Ollama).
Embedding fails with dimensions: 1024 errorLocal embedding model returns a different dimension. Pick one that natively returns 1024 (mxbai-embed-large) or stick with Mistral.