RAG · AI CONCEPT
Retrieval-Augmented Generation (RAG): how AI answers from your own documents
RAG couples a language model to a searchable knowledge base. Answers come with source attribution, not invented from training.
Researched & fact-checked by: DuneDive LLC · As of: 2026-05
What is RAG?
Retrieval-Augmented Generation, or RAG, is an architecture pattern that supplies a language model with relevant text passages from a private document library at answer time. Instead of letting the model speak from training memory, a retriever first searches a vector database for passages that match the query. These passages enter the prompt as additional context. The model then answers based on those passages – and can cite them.
The term originated in the 2020 Meta-AI paper (Lewis et al.). Since late 2023, RAG has been the standard pattern for in-company AI assistants in fields where the answer must be provable: law, fiduciary, medicine, insurance. As of May 2026, RAG is production-ready: vector databases (Qdrant, Weaviate, Milvus) run stably on your own hardware, and embedding models (OpenAI text-embedding-3, Cohere embed-multilingual-v3, BGE-large) are cheap and multilingual.
Why it matters
A language model without RAG hallucinates plausibly but incorrectly. For any question whose answer lives in a client contract, an internal manual, or an industry regulation, "plausible but wrong" is unacceptable. RAG closes that gap on three levels.
First: verifiable source. Every answer can point to the exact document and page from which it came. This is not just convenience – it is a precondition for audit-ready AI use under Art. 957a CO (bookkeeping) and for any work bound by professional secrecy (Art. 321 SCC).
Second: data sovereignty. In a correct implementation, the original document never leaves your hosting. Only the passage relevant to the query goes – encrypted – to the language model. Sensitive client data can be held to "EU model only" or "local only" (see Multi-LLM Routing).
Third: current knowledge. Models have a training cutoff (the current top Claude model, for example, January 2026). RAG sidesteps that: new documents get indexed and are immediately retrievable, no model retraining required.
How it works
A RAG pipeline has five stations: ingestion, chunking, embedding, retrieval, generation. Each station is swappable – that is what makes the architecture robust.
Ingestion: Documents (PDF, Word, email, HTML, OCR scans) are collected from their source systems. Tools like unstructured.io, LlamaIndex, or custom adapters convert to plain text and preserve metadata (client, date, confidentiality).
Chunking: Long documents are sliced into manageable pieces, typically 300–800 tokens with 50–100 token overlap. Structure-aware chunking (by paragraph or chapter) gives better retrieval quality than blind token-counting.
Embedding: Each chunk is converted by an embedding model into a vector – a list of 384, 768, 1024, or 3072 numbers. Semantically similar texts land near each other in the space. OpenAI text-embedding-3-small is cheap (~CHF 0.02 per 1M tokens) and good for German.
Retrieval: The vector database (Qdrant local, Pinecone hosted) finds the top k=4..10 chunks similar to the query. Optionally, reranking with a cross-encoder (Cohere Rerank 3, BGE-reranker) improves top hits by 15–30%.
Generation: The original question plus the retrieved chunks go to the language model as a prompt. With a clear instruction ("Answer only from the given sources. If the answer is not there, say not in the material"), the model produces a grounded answer with citations.
RAG workflow in 6 steps
- 01Inventory source systems: which documents, in what format, with what confidentiality tier?
- 02Choose a chunking strategy: structure-aware (Markdown headers, PDF bookmarks) instead of blind 500-token slicing.
- 03Choose an embedding model: OpenAI text-embedding-3-small for default, BGE-large-en/de for local-only, Cohere embed-multilingual for DE/FR/IT.
- 04Set up the vector database: Qdrant on-prem for revDSG, Pinecone for low-overhead setups.
- 05Build the retrieval logic: top-k = 8, optional cross-encoder rerank for the top 3.
- 06Define the answer prompt: "Answer only from the given sources. Cite as [1], [2]. If the answer is not there, say so."
When to use RAG
RAG is the right choice when (a) the answer lives in existing internal documents, (b) you must prove the source, and (c) the data volume is too large to copy into every prompt.
Real Swiss use cases: client FAQ from 5 years of correspondence, federal tax authority guidelines as searchable knowledge, an internal precedent library at a law firm, manuals and SOPs of an SME. Fiduciary offices use RAG for client onboarding (which documents do we need for an inheritance case in Zug?), for VAT preparation (which receipts are missing?), for collections (what does the last correspondence with this client say?).
The size of the knowledge base is surprisingly flexible. Qdrant indexes millions of chunks on commodity hardware; even a 500-person firm rarely exceeds 10 million chunks. Answer time stays under two seconds – even with large corpora.
When not to use
RAG is the wrong choice when the answer comes not from documents but from general world knowledge ("What does a Big Mac cost in Geneva?") – the language model alone is enough. RAG is also wrong when the data is small enough to fit in a single prompt – modern models have 200k to 2M tokens of context, and a 30-page guideline fits entirely in, faster and simpler than RAG.
Other pitfalls: if the original documents are not digital, RAG is not the first step – OCR and format conversion come first (see AI document recognition). If documents change frequently, the pipeline must auto-trigger re-indexing – otherwise RAG answers with stale data. If answers should be creative ("write me a rental contract from scratch"), RAG is restrictive – it grounds the answer on existing material and suppresses originality.
Trade-offs
STRENGTHS
- Answers come with source citations – verifiable, audit-ready
- Current knowledge without model retraining
- Data stays in your own infrastructure (with a local vector DB)
- Scalable: millions of documents without prompt-limit issues
WEAKNESSES
- More moving parts: ingestion, chunking, embedding, retrieval – every station can break
- Initial setup effort: 3–10 days depending on data variety
- Retrieval quality is only as good as the chunking – poorly sliced documents return poor answers
- Model still hallucinates when the retriever comes back empty and the prompt does not refuse
FAQ
What does a RAG pipeline cost for 10,000 documents?
One-time embedding setup: roughly CHF 15–40 (text-embedding-3-small). Qdrant storage: < CHF 5/month. Per query: about CHF 0.002 (embedding the question + retrieval) plus model costs. Total for a 5-person fiduciary office with 200 queries/month: < CHF 20/month running cloud cost, plus the setup.
Does RAG still hallucinate?
Less, not zero. If the retriever finds no fitting source, the model may still invent something – unless the prompt explicitly requires it to say so. Two countermeasures: (a) a clear refusal instruction in the system prompt, (b) a citation-check pipeline that verifies, after the answer, that the cited passages actually appeared in the retrieval result.
Do I need GPU hardware?
No, pure RAG retrieval runs on CPUs. Qdrant runs on commodity servers. GPU becomes relevant only if you want to run an embedding model or the language model itself locally (e.g. Llama 3.1 8B + BGE-large on-prem). For standard setups with a cloud LLM provider: no GPU required.