fairlane.systems

METADATA · AI CONCEPT

Metadata and filters in RAG: pre-filter vs post-filter, Qdrant payload index, pgvector WHERE

How structured metadata makes client, date, confidentiality, language and source filterable per chunk: pre- vs post-filter, Qdrant payload index, pgvector WHERE and time-aware retrieval.

Researched & fact-checked by: · As of: 2026-05

What is metadata in RAG?

Metadata in RAG are structured key/value pairs attached to every chunk alongside text and embedding: client ID, document type, creation date, confidentiality tier, language, source system, tags, author, MIME type, hash. They are the scaffolding that turns a semantic search system into a productive RAG tool.

Without metadata, retrieval finds the semantically closest chunks regardless of context logic. A question about client Bachmann pulls in answers about client Hartmann. A question about VAT 2024 surfaces answers from 2021. A payroll bookkeeper's query accesses confidential legal data they must not see. All semantically correct, all contextually wrong.

As of May 2026 metadata filtering is a mandatory building block in any production RAG pipeline. The technique is natively available in larger vector databases. Qdrant supports payload indices on arbitrary fields (keyword index, range index for numbers and dates, full-text index). Pinecone uses metadata filters via a JSON subset language. pgvector combines vector search with standard SQL WHERE clauses (HNSW index plus B-tree index). Weaviate has a schema-based filter system.

The design decision is: what gets stored in the metadata payload, and which fields are indexed. Indexing costs memory, filtering on non-indexed fields is slow. A well-designed metadata schema balances this. Typical mandatory fields: client ID, creation date, confidentiality tier, source, language. Optional: author, tags, document type, keywords.

Why it matters

Three effects make metadata a decisive factor in production RAG systems.

First: tenant isolation. In every fiduciary office and law firm, every client is entitled to having their data accessible only to the assigned staff. Without metadata filters a semantic search crosses tenants. With client ID as a mandatory filter on every query, search stays within client scope. This is not just convenience but a precondition for professional secrecy (Art. 321 SCC) and the client relationship.

Second: currency filtering. A query "standard VAT rate 2024" must not return a 2021 source. Time-aware retrieval filters sources to a relevant period, optionally weighted (newer sources scored higher). With legal changes this is mandatory: old guidelines must be marked and dropped on "current" queries.

Third: confidentiality tiers. A firm with paralegals must differentiate: public knowledge (legal text, guidelines), internal knowledge (practice notes), client knowledge (files), special confidentiality (compliance, disciplinary cases). Metadata tier plus role-based filters is the clean solution. A payroll clerk must not see attorney files.

Fourth: source audit. For audit-ready AI use (Art. 957a CO, FINMA circular 2024/4) every answer must cite sources. Metadata carry source information (file, page, creation date, author) that is embedded in the citation. Without metadata the answer is untraceable.

Fifth: performance. A pre-filter (the source is restricted before vector search) reduces the searched index area often by 90 percent. Search across 100,000 chunks becomes search across 5000 - faster and more precise.

How it works

Pre-filter (before vector search): the filter condition is applied before vector search, the index is restricted to matching chunks, then vector similarity is computed. Qdrant does this efficiently via payload indices (keyword index B-tree-like, range index for numbers and dates). Efficient under strict, selective filters (client=Bachmann reduces from 1M to 5000 chunks).

Post-filter (after vector search): first fetch top-k=100 with vector search, then apply the filter, finally take top-k=5. Works when the filter is little selective. Risky: under a highly selective filter, all top-100 may be discarded and the retriever returns zero results despite matching chunks existing.

Qdrant payload index: created per field via REST or client SDK. Keyword index for string fields (client_id, document_type). Integer/float range index for numbers (year, page_number). Datetime index for ISO-8601 dates. Full-text index for free-text fields (title, summary). Indexing costs memory: about 30 to 60 bytes per entry per indexed field. With 1M chunks and 5 indexed fields: 150 to 300 MB memory.

pgvector with WHERE: in Postgres, vector search runs as ORDER BY embedding <-> query_vector LIMIT k, with a standard WHERE filter applied first. B-tree indices on filtered fields accelerate the pre-filter phase. The HNSW index on the vector column accelerates proximity search. From pgvector 0.7, Postgres supports iterative index scans that automatically detect filter selectivity.

Time-aware retrieval: two patterns. (a) hard date filter: only chunks in the defined period qualify. (b) soft date weighting: chunks outside the period are score-penalised but not excluded. Pattern (b) is more robust on sparse data, pattern (a) clearer for legal applications.

Schema design best practices: 5 to 10 mandatory fields per chunk, no more. Client ID always indexed. Creation date as ISO-8601 always indexed. Confidentiality tier as enum (public/internal/client/restricted) always indexed. Language as ISO-639 code always indexed. Source system as string indexed. Tags as array if relevant. Custom fields optional, not indexed. Schema changes require re-indexing.

Client erasure rights: under Swiss revDSG erasure, all chunks with client_id=X can be removed in one query. Precondition: client_id is payload-indexed, otherwise deletion becomes a full scan.

Metadata workflow in 6 steps

  1. 01Define mandatory fields: client_id, document_type, confidentiality_tier, language, created_at, source_system. Version the schema.
  2. 02Plan indexing: all mandatory fields payload-indexed, optional fields only when provably selective.
  3. 03Ingestion pipeline: extract metadata at chunking time from source files (file system, headers, mail headers etc.) or enrich via LLM classification.
  4. 04Pre-filter by default: every RAG query carries at least client_id and a role-based confidentiality_tier in its filter clause.
  5. 05Time-aware: equip time-sensitive fields with a hard date filter (legal) or soft weighting (research).
  6. 06Audit trail: per chunk and per query log the filter clause and hit list. On Swiss revDSG erasure, remove all chunks with client_id=X.

When to use it

Pre-filter on mandatory fields (client, confidentiality, language): always. No sensible production RAG setup exists without these.

Date filter: on time-sensitive corpora (legal texts, market data, client correspondence). Optional on static corpora (encyclopaedias, finished guideline collections).

Tags filter: on thematically grouped corpora (contract types, industry segments). Adds little schema complexity, clearly improves hit quality.

Full-text index on title/summary: when metadata quality is good and users often search by document title or summary terms.

Post-filter: only when the filter is little selective (e.g. language=German in a predominantly German corpus). For selective filters always pre-filter.

In fiduciary/legal contexts we recommend at least these indexed fields: client_id, document_type, confidentiality_tier, language, created_at, source_system, role_required.

When to hold back

Very small corpora under 1000 chunks: filter indices are overkill. A full scan is faster than index overhead.

Corpora without structured metadata sources: when only plaintext documents exist, metadata must first be extracted (LLM-based or heuristic). On small volumes the effort is not worth it.

Fields with very high cardinality and little filter use: do not index. Costs memory without benefit.

Fields with very low selectivity (e.g. language=German in 99 percent of chunks): an index does not help because the pre-filter barely reduces the index.

Caveat on schema changes: every new indexed column requires re-indexing. On large corpora this can take hours to days. Make schema decisions early and stable.

Caveat on filter logic inside the query itself: using the answer LLM to generate filter clauses from the user question can be useful but is hallucination-prone. Prefer structured UI filters (client, date, type as dropdown).

Trade-offs

STRENGTHS

  • Tenant isolation per client, a precondition for professional secrecy
  • Time-aware retrieval prevents outdated answers
  • Pre-filter cuts index load 70 to 95 percent on selective queries
  • Source audit for Art. 957a CO and FINMA-compliant AI use

WEAKNESSES

  • Schema design requires discipline and forward planning
  • Indexed fields cost memory: 30 to 60 bytes per chunk per field
  • Schema change forces re-indexing, hours on large corpora
  • Metadata extraction from unstructured sources requires LLM enrichment

FAQ

How many fields should I index?

Rule of thumb: 5 to 8. More indexed fields cost memory; fewer make many queries slow. In a fiduciary context: client_id, document_type, confidentiality_tier, language, created_at, source_system, role_required. Use case-specific additions like industry or case_status.

Pre-filter or post-filter?

Pre-filter for selective filters (filter reduces the index by more than 30 percent). Post-filter only for little selective filters (language in a mostly single-language corpus). Pre-filter is also the safe default for client isolation.

How do I handle time-aware retrieval?

Two paths. For legal sources with clear validity dates: hard filter (created_at <= query_date AND valid_until >= query_date). For client correspondence with relative relevance: soft weighting in the score (newer chunks get e.g. a 1.2x score bonus). Both patterns implement via Qdrant payload filters.

What does metadata indexing cost?

At 1M chunks and 7 indexed fields: about 200 to 400 MB extra RAM on the Qdrant instance. Disk storage adds 10 to 20 percent. Performance gain: 5x to 50x speed-up on selective queries. ROI clearly positive.

Related topics

RAG · AI CONCEPTRetrieval-Augmented Generation (RAG): how AI answers from your own documentsQDRANT · TECHQdrant: production vector database for RAG and semantic searchVECTOR DB · AI CONCEPTVector databases compared: Qdrant, Weaviate, Milvus, Pinecone, Chroma, pgvectorCHUNKING · AI CONCEPTChunking strategies for RAG: fixed-size, recursive, semantic, late chunkingHYBRID SEARCH · AI CONCEPTHybrid search: BM25 plus vectors with reciprocal rank fusion in Elasticsearch, Qdrant, OpenSearchANONYMISATION · AI CONCEPTAnonymisation and pseudonymisation: Presidio, Privacera, k-anonymity, differential privacyAUDIT TRAIL · AI CONCEPTAI audit trail design: what to log so an AI answer stays audit-ready

Sources

  1. Qdrant - Payload and indexing documentation · 2026-05
  2. pgvector - iterative index scans and HNSW filtering (0.7+) · 2026-05
  3. Pinecone - metadata filtering and best practices · 2026-05
  4. Weaviate - filter operators and where filters · 2026-05

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call