01 · Section
The default answer is RAG
RAG (retrieval-augmented generation) lets you use a frontier model — Claude, GPT, Gemini — with your private data fetched at query time. You get fresh answers, citations, and the ability to swap models without retraining.
Fine-tuning bakes knowledge into model weights. It is slower to iterate, more expensive, and the moment your data changes the model is stale. For 90% of business use cases, RAG wins on every axis: cost, speed, freshness, observability.
02 · Section
When fine-tuning is genuinely the right call
Style. If you need the model to write in a very specific voice — your brand, a regulated tone, a domain dialect — fine-tuning teaches that more reliably than a long system prompt.
Latency. If you need sub-200ms responses on a small, focused task and cannot afford retrieval overhead, a fine-tuned small model can be the only viable option.
Privacy. If your data legally cannot leave a dedicated environment, fine-tuning an open-weight model (Llama, Mistral, Qwen) on a VPC lets you keep everything in-house.
03 · Section
A simple decision flow
Does the answer depend on data that changes more than monthly? → RAG.
Do you need source citations in the answer? → RAG.
Is the task primarily about style or format, not knowledge? → Fine-tune.
Do you have hard latency or privacy constraints? → Fine-tune (small open-weight model).
Most projects answer "yes" to the first two. Build RAG first, measure, and only add fine-tuning when you hit a wall it cannot solve.
Key takeaways
- RAG should be the default for any knowledge-based use case.
- Fine-tune only for style, latency or privacy constraints RAG cannot meet.
- Build RAG first, add fine-tuning as a second-stage optimisation if needed.
- Measure with a golden eval set before and after every architectural change.
Tags
Written by
Hassan Ali
9 min read · Posted in AI/ML