How

Carole runs end-to-end on one CPU box. No GPU runtime, no third-party inference API, no ongoing compute bill. Here’s the request flow.

Base model

Llama 3.1 8B Instruct. Open weights, permissive enough license to fine-tune and redistribute (with naming attribution), and the right size — large enough to follow a deliberate persona, small enough to quantize down to ~4.5GB and run on CPU at usable speeds.

Fine-tune

QLoRA on RunPod. Training data was 50 hand-written golden examples expanded synthetically to ~1,500 conversations, all demonstrating the validate-then-redirect pattern with source citations. Best checkpoint landed at eval_loss 1.40 after epoch 2 — the loss curve had a clean U-shape, classic signal that epoch 2 was the right stop.

RAG corpus

1,732 chunks in ChromaDB, embedded with all-MiniLM-L6-v2 (384-dim cosine). 957 chunks from 98 curated Wikipedia articles spanning ADHD, autism, RSD, anxiety, depression, executive function, NVC, CBT/DBT, mindfulness. 775 chunks from five reference works: Robbins’ Awaken the Giant Within, Carnegie’s How to Win Friends, Rosenberg’s Nonviolent Communication, plus two ND-specific articles on RSD and ADHD communication. Curated, not crawled.

Inference

Quantized to Q4_K_M (~4.5GB) and served by llama.cpp on the box. Roughly 26 tokens/second on CPU. FastAPI orchestrates: it embeds the query, retrieves top-K chunks from ChromaDB, builds the RAG prompt, and streams the model’s response back over SSE.

Streaming

Tokens are buffered into whole sentences before they’re sent to the frontend. The cadence matters: token-by-token streaming is jittery, especially on CPU; sentence-by-sentence reads as a person thinking, then speaking. For a chatbot built around the validate-then-redirect pattern, the pause is the point.

Sources

Model on Hugging Face (private — available on request)
Code: private GitHub repo, available on request