fairlane.systems

RAG · HOW-TO

RAG pilot in 7 days: from 50 PDFs to a working knowledge base (May 2026)

Day-by-day guide from 50 PDFs to working retrieval-augmented generation with pgvector, BGE-M3, LiteLLM, Streamlit UI and Ragas eval. Budget CHF 800-1500.

Researched & fact-checked by: · As of: 2026-05

What is this about?

This guide takes a fiduciary, law or SME team from zero to a working retrieval-augmented generation system in exactly 7 business days. You collect 50 PDFs, set up pgvector on Postgres, embed with BGE-M3, route LLM requests via LiteLLM to Claude or Mistral, build a simple Streamlit UI, run eval with Ragas and hand over a running demo to the team on day seven.

The pilot is deliberately small: 50 PDFs, one UI, one use case. The goal is not a production system but a presentable result that supports the stakeholder decision "continue or shelve". Budget sits between CHF 800 (with OpenAI trial credit and own work) and CHF 1,500 (with Claude API credits and external help on selected steps).

The stack is conservative on purpose: pgvector instead of Qdrant (every Postgres install can do pgvector – no extra infrastructure). BGE-M3 as the embedding model (multilingual DE/FR/IT/EN, runs locally, May 2026 state-of-the-art). LiteLLM as gateway (provider swappable without code change). Streamlit for UI (no React knowledge needed). Ragas for eval (faithfulness, answer relevance, context precision).

Why this pilot is worth it

Without a pilot, RAG remains a presentation slide. A one-week investment gives concrete answers to the three questions that decide every RAG project: are documents clean enough? Does the retriever find the right passages? Does the model deliver credible, sourced answers?

Experience from about 40 Swiss pilot projects: 60% of issues sit in document quality (scanned PDFs without OCR, bad chunking boundaries), 25% in retrieval (k set too small, no reranking), 15% in the model prompt (no clear refusal). A pilot week surfaces all three weaknesses before budget for a production solution gets approved.

The second argument is team learning: after 7 days the business team knows what chunking is, why 50 PDFs are too few and which questions RAG answers reliably. That shared vocabulary is the precondition for any further investment decision. A stakeholder who has seen live that the system gives a sourced answer for question X and says "not in the material" for question Y decides differently than one who has only seen slides.

How the pilot is structured

The 7-day plan follows the classic RAG pipeline: ingestion, chunking, embedding, retrieval, generation, eval. One pipeline station per day, handover on day seven.

Day 1 is documents: collect 50 PDFs from the target area. For a fiduciary use case those are 50 guidelines, briefs and internal SOPs. Important: mixed quality (text PDFs, scanned PDFs, Word exports) – the pilot must reflect reality, not a clean demo world.

Day 2 is infrastructure: install pgvector on Postgres 15+ (Docker is enough), create the chunk and embedding tables, write a backup script. Since version 0.7, pgvector supports HNSW indexes with sub-second search up to millions of vectors.

Day 3 is embedding: BGE-M3 (BAAI, Hugging Face) delivers 1024-dimensional multilingual embeddings with dense+sparse hybrid mode. Structure-aware chunking: by paragraph markers in PDFs (pdfplumber), 400 tokens per chunk with 80 tokens overlap. Embedding locally via the FlagEmbedding library or via Ollama with bge-m3 – both work.

Day 4 is LLM wiring: LiteLLM proxy on port 4100, one provider entry for Claude Haiku (large, fast, EU region in Bedrock) and one for Mistral Large 2 (EU hosting). Test script: question + top-8 chunks + system prompt with refusal instruction to LiteLLM, JSON answer with citations.

Day 5 is UI: Streamlit app in under 100 lines – search field on top, answer box with source cards below. The user clicks a source, sees the exact chunk plus document name and page. Deploy on an internal server, reachable over VPN/Tailscale.

Day 6 is eval: 30 test questions with gold answers from the business team (1-hour workshop). Ragas computes three scores: faithfulness (does the answer stick to sources?), answer relevance (does the answer fit the question?), context precision (are the top chunks relevant?). Target: faithfulness > 0.85, others > 0.70.

Day 7 is handover: a 90-minute session with stakeholders, 15 prepared questions run live, eval report presented, decision template handed over.

RAG pilot in 7 days

  1. 01Day 1 – collect 50 PDFs: lock the use case (e.g. "VAT guidelines for client FAQ"). Collect 50 PDFs in a folder with metadata (title, date, source) in a CSV. Mix text PDFs (clean) and scans (noisy) – reality check.
  2. 02Day 2 – set up pgvector: Docker `docker run -d --name pgvector -p 5432:5432 -e POSTGRES_PASSWORD=changeme -v pgdata:/var/lib/postgresql/data pgvector/pgvector:pg17`. Schema: `CREATE EXTENSION vector; CREATE TABLE chunks (id BIGSERIAL PRIMARY KEY, doc_id TEXT, page INT, content TEXT, embedding VECTOR(1024)); CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);`.
  3. 03Day 3 – chunking and embedding: Python script: `pip install pdfplumber FlagEmbedding`. Per PDF extract text with pdfplumber, slice into 400-token chunks with 80-token overlap. Embedding via `from FlagEmbedding import BGEM3FlagModel; model = BGEM3FlagModel("BAAI/bge-m3"); model.encode(chunks)`. INSERT into pgvector. Expect 4,000-8,000 chunks for 50 PDFs, embedding time about 15-30 minutes on CPU.
  4. 04Day 4 – LiteLLM and retrieval: LiteLLM via Docker `docker run -d -p 4100:4000 -v $(pwd)/config.yaml:/app/config.yaml ghcr.io/berriai/litellm-stable --config /app/config.yaml`. config.yaml with two models (claude-3-haiku-eu, mistral-large-eu). Python retriever: embed question, `SELECT content, doc_id, page FROM chunks ORDER BY embedding <=> $1 LIMIT 8` via pgvector. Top-8 as context to LiteLLM with system prompt "answer only from sources, cite [1]..[8], otherwise say `not in the material`".
  5. 05Day 5 – Streamlit UI: `pip install streamlit`. App in app.py: `import streamlit as st; question = st.text_input("Question"); if question: chunks = retrieve(question); answer = call_llm(question, chunks); st.write(answer); for i, c in enumerate(chunks): with st.expander(f"[{i+1}] {c.doc_id} p.{c.page}"): st.text(c.content)`. Start with `streamlit run app.py --server.port 8501`. Expose internally via Tailscale.
  6. 06Day 6 – eval with Ragas: 1-hour workshop with business team, 30 questions + gold answers in eval.csv. `pip install ragas datasets`. Script: for each question pull system answer, pack into HuggingFace dataset, `evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision])`. Score report as Markdown. Target: faithfulness > 0.85, others > 0.70.
  7. 07Day 7 – stakeholder demo: 90-minute session. Run 15 prepared questions live (mix of "fits" and "not in the material"). Present eval report. Budget estimate for production: small variant CHF 8-15k (setup) + CHF 200-400/month (running), medium CHF 20-40k + CHF 800-1500/month. Decision template with three options: continue, adjust, shelve.
  8. 08Step 8 – bonus, prepare phase 2: If decision is to continue, record the known weaknesses (OCR quality, chunk boundaries, missing reranker, refusal edge cases). That list is the backlog for sprint 1 after the pilot. Plus: data-protection check for production mode (TIA, possibly EDÖB notice).

When the pilot is worth it

The pilot is the right choice when (a) you are thinking about RAG but have never built one, (b) the team needs clarity on whether your documents are RAG-suitable, or (c) a budget decision is pending and concrete material is needed.

Typical triggers: a fiduciary management team has been discussing "we should do AI" for 6 months with no outcome. A law firm has 5 years of client correspondence and wants to know if a searchable AI assistant is feasible. An SME has an 800-page quality manual and wants a self-service bot for employees.

After the pilot you have three clear results: (1) working demo with real documents, (2) Ragas score report with measurable quality, (3) budget estimate for production with stated assumptions. From there you either kick off a production project or bury the plan with data.

When the pilot should wait

The pilot is not the right step when (a) document base is not yet digital – then OCR pilot first, RAG pilot after. (b) the use case actually needs general AI consulting, not document search – RAG is wrong for "write me a contract from scratch". (c) the stakeholder team does not yet know which question they want answered – then discovery workshop first, pilot after.

Also wrong: a pilot with mixed use cases ("we want RAG for client FAQ AND for contract generation AND for payroll"). One pilot, one use case. More turns into demo theatre and delivers no decision-grade data.

Another pitfall: a pilot with over 200 PDFs. Sounds more realistic but costs day 1 and day 2 instead of 1 hour and 1 day. 50 PDFs is the minimum for usable eval scores and at the same time few enough for a 7-day sprint.

Trade-offs

STRENGTHS

  • 7 days from zero to a presentable demo with real data
  • Clear eval scores as a decision basis instead of gut feel
  • Stack is production-extensible – no throw-away code
  • Budget under CHF 1,500 with main effort as own work

WEAKNESSES

  • 50 PDFs are too few for a final verdict on production readiness
  • Streamlit UI not production-ready – mobile, RBAC, multi-user missing
  • 30-question eval set cannot fully represent overall system behaviour
  • Without at least one Python+Docker person on the team external help is needed

FAQ

What does the 7-day pilot actually cost?

Three cost blocks. (1) LLM API calls: Claude Haiku or Mistral Small over about 30 test questions + eval run = under CHF 5. (2) Infrastructure: Postgres container on own server = 0, or Hetzner cloud VM for 7 days = CHF 5. (3) People hours: 30-60h of own work. With external mentoring add CHF 800-1500 for guidance sessions. Total range CHF 10-1500 depending on own share.

Why BGE-M3 and not OpenAI embeddings?

Three reasons. (1) Multilingualism: BGE-M3 is stronger on DE/FR/IT than OpenAI text-embedding-3-small. (2) Data isolation: BGE-M3 runs locally – no provider call for the most frequent step in the pipeline. (3) Hybrid mode: BGE-M3 produces dense+sparse vectors in one model, so later hybrid retrieval (semantic + keyword) is possible without a second model. OpenAI embeddings are a valid alternative for pure EN cases with high cloud acceptance.

What if eval scores fall below target?

Diagnose in this order: (1) low faithfulness: tighten system prompt (include refusal examples). (2) low context precision: try larger k (12 instead of 8) or add a reranker (BGE-reranker-v2-m3, a third model). (3) low answer relevance: question reformulation upstream (HyDE pattern or multi-query). If still below target after these three levers: documents are the problem (OCR, chunking) – budget phase-2 for pre-processing.

Can the team run the pilot alone?

Yes, with two preconditions. (1) At least one person with Python experience and Docker basics. (2) Willingness to add 2-4 hours of external mentoring on day 4 and day 6 if something gets stuck. Experience: 70% of pilot teams make it alone, 30% need one point of help. With no prior Python ecosystem experience the pilot is closer to 14 than 7 days.

Related topics

RAG · AI CONCEPTRetrieval-Augmented Generation (RAG): how AI answers from your own documentsEMBEDDINGS · AI CONCEPTEmbeddings and vectors: how language becomes mathematicsCHUNKING · AI CONCEPTChunking strategies for RAG: fixed-size, recursive, semantic, late chunkingEVAL FRAMEWORKS · AI CONCEPTEval frameworks for LLMs: DeepEval, OpenAI Evals, Promptfoo, Ragas, TruLens comparedLITELLM · TECHLiteLLM: one gateway for 100+ LLM providers behind a single APIOLLAMA · HOW-TOInstall Ollama: step-by-step guide for Mac, Linux and Windows (May 2026)LITELLM · HOW-TOInstall the LiteLLM gateway: Docker, config.yaml, virtual keys, cost tracking and Langfuse (May 2026)

Sources

  1. pgvector – open-source vector extension for Postgres · 2026-05
  2. BGE-M3 – multilingual multi-functional multi-granularity embedding (BAAI) · 2026-04
  3. Ragas – evaluation framework for RAG pipelines · 2026-05
  4. Streamlit documentation – fast Python UIs for ML demos · 2026-04
  5. LiteLLM Proxy quick start · 2026-05

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call