RAG ON YOUR OWN KNOWLEDGE · SERVICE
RAG on your own knowledge: answers from your documents – with sources, not made up
Searchable knowledge base with chat. PII redaction, citation check, DE/FR/IT/EN. Pilot up to 10,000 docs CHF 3,500, Production from CHF 8,500.
Researched & fact-checked by: DuneDive LLC · As of: 2026-05
What does the service include?
We build you an AI that answers from your own documents – contracts, manuals, client files, guidelines, correspondence. With a source citation per answer, not invented. PII redaction before the model call and citation check after the answer. Languages: German, French, Italian, English. Hosting: on your server, with Qdrant as the vector database.
The service is the production implementation of the Retrieval-Augmented Generation architecture pattern (see the linked concept topic for the theory). We deliver the service view here: what you get, how long it takes, what it costs, and what the practical levers are.
Variants: Pilot up to 10,000 documents (CHF 3,500) as a 2- to 3-week implementation with one data source, a chat UI and basic logging. Production over 100,000 documents (from CHF 8,500, custom) with multiple data sources, automatic re-indexing on changes, hybrid retrieval (vector plus BM25 plus reranker), citation guards and an audit trail under Art. 957a CO.
What we do not do: pure AI consulting without implementation, client data migration from paper files (that is digitisation, not RAG), or a pure cloud RAG solution in a US hyperscaler. The stack runs on your server.
Why this service
The productive RAG implementation solves three problems.
Hallucination. A language model without RAG answers plausibly but incorrectly – especially on industry-specific knowledge. "Which guideline applies to 2026 expense flat rates?" GPT-4 answers with imaginary paragraphs. RAG grounds the answer on the original document, gives the page number, and says "not in the material" when the source is missing.
Data location. Your client correspondence, your contract storage, your manuals – these should not wander to ChatGPT "to improve the service". With Qdrant on-premise (on your Hetzner server) the material stays in the EU. Only the passage relevant to the query wanders – encrypted – to the chosen language model, and even that can be restricted to EU-only or local (see Multi-LLM Gateway).
Compliance fitness. Bookkeeping-relevant processes require an audit trail under Art. 957a CO. If an AI suggestion lands in a reminder, an invoice or a contract, the source behind it must be traceable. RAG with citation logging delivers that automatically – every answer is linked to a source hash and retrieval snapshot.
In 2026 RAG is the standard pattern for Swiss fiduciary and law-firm AI. The OWASP LLM Top 10 (2026 edition) explicitly lists "Sensitive Information Disclosure" and "Vector and Embedding Weaknesses" – RAG with PII redaction and citation check is the direct answer.
How the service is delivered
Delivery runs in six phases over 2 to 4 weeks (Pilot) or 6 to 12 weeks (Production).
Phase 1 – Source systems: We inventory where documents come from – file share, SharePoint, mail attachments, CRM attachments, scan archive. We classify confidentiality (public / internal / confidential / professional secrecy) and define which classes go into the index at all.
Phase 2 – Ingestion and chunking: Tools like unstructured.io or LlamaIndex convert PDF, Word, HTML and OCR scans into plain text. Structure-aware chunking by Markdown headers or PDF bookmarks beats blind token slicing. Typical chunk size: 300 to 800 tokens with 50 to 100 token overlap.
Phase 3 – PII redaction: Before embedding, a redaction pipeline (Microsoft Presidio or a regex-plus-NER pipeline) runs over the text. It replaces names, social-security numbers, addresses, IBANs and phone numbers with tokens (`[PERSON_1]`, `[IBAN_3]`) stored in a separate map. The original document stays untouched – only the index sees the redacted version. On output the redaction can optionally be reversed for internal users with the right permissions.
Phase 4 – Embedding and index: We recommend OpenAI text-embedding-3-large (3072 dimensions, multilingual, ~CHF 0.10 per 1M tokens) or Cohere embed-multilingual-v3 (1024 dimensions, EU-friendly, slightly cheaper). The vectors land in Qdrant – with metadata for client, date, confidentiality and source URL. Qdrant runs as a Docker container on your server.
Phase 5 – Retrieval and generation: At query time the question is embedded first, then the top-k chunks are retrieved through Qdrant (typically k=8). The Production variant adds hybrid retrieval (Qdrant vector plus BM25 full text) and a cross-encoder reranker (Cohere Rerank 3, BGE-reranker) that surfaces the top 3. The question plus the chunks go with a clear refusal instruction to the language model – routed via the multi-LLM gateway, with a model matching the data class.
Phase 6 – Citation check and audit log: After the answer a citation pipeline verifies that the cited passages actually appeared in the retrieval result. Hallucinated citations are filtered and marked as "source not in the material". Every request enters the audit log: prompt hash, retrieval snapshot, chosen model, token counts, answer, citation-check outcome. In the Production variant this is secured by hash chains and laid out for Art. 957a CO.
RAG service workflow in 7 steps
- 01Inventory source systems: which documents, in what format, with what confidentiality tier?
- 02Build the ingestion pipeline: convert PDF/Word/mail/HTML to clean text with metadata.
- 03Wire up PII redaction: tokenise names, social security numbers, IBANs, addresses before embedding.
- 04Embedding and indexing: text-embedding-3-large or Cohere multilingual, vectors into Qdrant.
- 05Retrieval logic: top-k = 8, optional hybrid with BM25 plus cross-encoder rerank.
- 06Answer prompt with refusal instruction and citation check before output.
- 07Audit log and re-indexing: log every request, re-index changed documents automatically.
When the service is worth it
The service is worth it when (a) answers live in your documents and not in general world knowledge, (b) you must prove the source – whether for bookkeeping duty, professional secrecy or audit trail, (c) you have too much material to simply paste into a prompt (typically from 50 to 100 documents up), and (d) the material is digital (PDF, Word, email – no paper archive without OCR).
Real Swiss use cases: a law firm with 8 lawyers indexes 15 years of client correspondence – a new client asks, the lawyer gets a 5-second summary with citations from the relevant files. A fiduciary with 80 mandates indexes the cantonal tax guidelines – on VAT questions the correct clause comes back with paragraph and date. An industrial SME indexes 4,000 SOPs, machine manuals and repair logs – service technicians ask via WhatsApp bot and get the right passage.
The Pilot is the right choice when you have a single data source, one department and a clearly scoped question. Production becomes worthwhile when several data sources combine, automatic re-indexing is needed (client files change), or when the answers feed into regulated processes (tax advice, legal information, compliance).
When not
The service is the wrong choice when (a) your documents are not digital – then digitisation with OCR comes first, a separate service not part of this module, (b) you have too little material (under 30 to 50 documents fit into the context of a modern language model, RAG is overhead), (c) you want creative text from scratch (sales copy, new contract drafts – RAG grounds on existing material and suppresses originality), or (d) your data is too sensitive to be indexed at all.
The last point matters: even with a local vector DB the embedding vector remains a representation of your text. In rare cases fragments of the original can be reconstructed from embeddings (see "Embedding Inversion Attacks", arXiv 2024). For extremely sensitive data (internal investigations, criminal mandates, M&A preparations) RAG is not the first step – but rather an isolated sub-database per mandate with an encrypted index.
And: RAG does not solve a workflow problem. It returns answers – not actions. Anyone who wants to auto-book invoices, auto-enter calendar items or auto-reply to mail needs workflow automation in addition (see n8n module). RAG is the knowledge layer; n8n is the action layer.
Trade-offs
STRENGTHS
- Answer with source citation – verifiable, audit-fit under Art. 957a CO
- Data stays in your own infrastructure, Qdrant on-premise
- PII redaction before the model call – even local vectors do not see plaintext names
- DE/FR/IT/EN multilingual – Cohere embed-multilingual covers Swiss languages
- Pilot delivered in 2 to 3 weeks – not 6 months of project time
WEAKNESSES
- Moving parts: ingestion, chunking, embedding, retrieval – every station can break
- Non-digital material needs OCR digitisation first – extra effort
- Retrieval quality is only as good as the chunking – poorly sliced documents yield poor answers
- For extremely sensitive data (M&A, criminal law) embedding inversion remains a residual risk
- Does not solve a workflow problem – actions need n8n in addition
FAQ
How long does a pilot really take?
With pre-prepared digital documents and a single data source: 2 calendar weeks, of which 4 to 6 person-days of effort. With multiple sources or scan PDFs needing OCR: 3 to 4 weeks. We give an honest re-estimate at the end of week one – if the material is harder than expected, you see that early.
What does ongoing operation cost?
For a 5-person fiduciary office with 200 queries per month and 10,000 documents: Qdrant container costs zero (runs on existing server), one-time embedding setup CHF 15 to 40, per query about CHF 0.002 plus language model cost. Total under CHF 30 per month in running cloud cost. With 100,000 documents and 2,000 queries per month: CHF 100 to 250.
What if the material sits in OneDrive or SharePoint?
Connectable. We have connectors for Microsoft Graph API, Google Drive API, Nextcloud, local file shares and IMAP mail. For OneDrive or SharePoint an incremental sync runs via webhook so changed documents re-index within minutes. Microsoft 365 access rights are honoured – a case worker only sees answers from documents they have access to.
How do you prevent the model from hallucinating?
Three layers. First: a clear refusal instruction in the system prompt – "answer only from the given sources. If the answer is not there, say so." Second: a citation check after the answer that compares each citation against the retrieval result. Third: for critical use cases (tax advice, legal information) a human in the loop – the AI proposes, a human approves.
Related topics
Sources
- Lewis et al. – Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Meta AI) · 2020-05
- Qdrant – Production vector search engine documentation · 2026-05
- Microsoft Presidio – PII detection and anonymisation · 2026-03
- OWASP – Top 10 for LLM Applications 2026 · 2026-02
- Kiteworks – RAG Pipeline Security Best Practices 2026 · 2026-03
- OpenAI – Embeddings guide (text-embedding-3 family) · 2026-04