fairlane.systems

QUERY EXPANSION · AI CONCEPT

Query expansion and rewriting: HyDE, decomposition, multi-query, step-back prompting

How to rewrite short user questions so RAG finds the right sources: HyDE, query decomposition, multi-query, step-back prompting. When it helps, when it does not.

Researched & fact-checked by: · As of: 2026-05

What is query expansion?

Query expansion and query rewriting insert a reformulation between the user question and the retrieval step. The assumption: users ask briefly, imprecisely and ambiguously ("notice period?"). Documents are long, precise and contextualised ("Ordinary notice period of a tenancy under Art. 266a CO"). Embedding and BM25 cannot always bridge the semantic gap between the two. A better-phrased query solves many retrieval problems.

Four techniques have proved effective as of May 2026.

HyDE (hypothetical document embeddings, Gao et al. 2022): the LLM is given the user question and writes a fictional ideal answer text. That fictional text is then embedded and searched against the index. The idea: a fictional document sits closer in vector space to real answer documents than the short question itself. In studies (Gao et al., empirically reproduced) HyDE improves retrieval recall by 5 to 20 percent on short, imprecise queries.

Query decomposition: the LLM splits a complex query into sub-queries. "How did our clients' VAT balance evolve between 2023 and 2025, and which industries gained the most?" becomes three sub-queries retrieved individually. Results are merged.

Multi-query (diversification): the LLM generates 3 to 5 different paraphrases of the same question. Each is retrieved separately, results are merged via RRF. More robust against phrasing luck.

Step-back prompting (Zheng et al. 2023, Deepmind): before the real query a more abstract pre-question is asked ("What is the general rule for notice periods in Swiss tenancy law?"), which retrieves a broader context. Then the specific question is answered using that context.

All four techniques put an LLM between the user and the retriever - they cost tokens and latency. Blindly applied, they burn money for no benefit. Selectively applied (based on query complexity) they noticeably improve answer quality.

Why it matters

In fiduciary and legal offices we see a recurring pattern: 30 to 50 percent of RAG queries are short, telegraphic and ambiguous. "VAT 2024?", "Bachmann contract?", "reminder?". Without rewriting, the retriever often finds the wrong thing because the vector distance between a two-word query and an 800-token chunk is large.

The consequence: more miss queries, frustrated users, eroding trust in the system. We measure this via retrieval recall@5 on an eval set of real questions. Before rewriting, values often sit at 50 to 65 percent. After selective HyDE plus multi-query, in good setups 75 to 85 percent. That is the difference between "unusable" and "standard tool".

Second advantage: hard questions become answerable at all. A query like "which of our clients will move below the Art. 957a CO threshold in 2026?" contains an implicit knowledge requirement (what does Art. 957a CO say? Which thresholds? Which clients?). Decomposition splits this into answerable parts.

Cost vs benefit is not trivial. Every rewriting step is an extra LLM call. Naively enabled (always multi-query plus HyDE), per-query cost triples. The art is: a lightweight routing LLM (Claude Haiku, Mistral Small) decides per query whether rewriting is needed - and which technique. Simple queries ("tax rate canton Zug?") need no rewriting. Complex ones do.

For Swiss firms data sovereignty is an additional layer: the rewriting LLM sees the original query, which often carries client reference. In sensitive setups the rewriting LLM runs locally (Ollama with Llama 3.3 or Mistral Small), not in the cloud.

How it works

HyDE implementation: before retrieval the pipeline calls a lightweight LLM with the prompt "please answer the following question in 3 to 5 sentences as if it were an internal knowledge document: {question}". The generated text is embedded and searched against the vector index. The trick: the LLM may hallucinate - the text is only a vector probe, not the final answer. If the hallucination lies "near" the real answer, the retriever finds the real source.

Query decomposition: the routing LLM analyses the question and decides whether it is atomic or compound. For compound it produces an ordered list of sub-questions: "[1] What is the VAT threshold in 2024? [2] Which of our clients exceeded it in 2023?". Each sub-question goes through the full RAG pipeline; results are merged into the final answer prompt.

Multi-query: the LLM receives the prompt "produce 4 different paraphrases of this question". Each paraphrase is retrieved (in practice top-k=5 per paraphrase). The 20 chunks are merged via RRF and deduplicated; the top-k=8 enter the answer prompt.

Step-back prompting: two sequential retrieval steps. First the abstract question ("what is the general rule for X in Swiss law?") retrieves context knowledge. Then the specific question retrieves concrete application cases. Both result sets enter the answer prompt.

Selective activation: a router LLM (Claude Haiku, about 0.0005 USD per query) decides per input: "this question is atomic, no rewriting" or "this question is multi-step, activate decomposition" or "this question is very short and imprecise, activate HyDE".

In practice we recommend the following stack: a router LLM decides per query. Default: no rewriting (save). On short/imprecise: HyDE. On complex/multi-step: decomposition. Multi-query as a special tool for critical research where robustness matters more than token frugality. Step-back for legal queries where the general rule clarifies the specific case.

LLM choice for rewriting: Claude Haiku or Mistral Small (local) - fast, cheap, good enough for paraphrasing. A full model (Claude Sonnet) is overkill.

Query rewriting workflow in 6 steps

  1. 01Build an eval set: 50 real queries with manually marked target chunks. Measure recall@5 before rewriting as baseline.
  2. 02Define a router LLM: Claude Haiku or Mistral Small decides per query which technique (or none) applies.
  3. 03Sharpen the HyDE prompt: "answer as 3 to 5 sentences, factual, no speculation". Model choice: small fast model, not the answer model.
  4. 04Decomposition prompt: "decompose the query into atomic sub-questions, return a list". Each sub-question goes through full RAG.
  5. 05Multi-query: "produce 4 different paraphrases". Top-5 per paraphrase, RRF merge, top-8 into the answer.
  6. 06Measure on eval: recall@5 with and without rewriting per technique. Document latency and token cost. Iteratively improve routing.

When to use what

HyDE: on short, imprecise questions (1 to 4 words), on term-only queries without context, on concept searches in embedding space. Especially effective for legal and tax queries.

Decomposition: on compound queries with "and", time comparisons, aggregations. Also on implicit multi-questions ("who is responsible and since when?").

Multi-query: on critical research where a hit absolutely must not be missed. Compliance audits, forensic queries. Expensive (3 to 5x token cost) but robust.

Step-back: on legal queries where interpretation of the general rule clarifies the specific case. On medical or tax questions with hierarchical knowledge structure.

Selective routing: always. A router LLM deciding per query saves 60 to 80 percent of rewriting cost compared to blanket activation.

When not to use

Atomic, precise queries ("which tax rate applies in 2024 in canton Zug?"): no rewriting needed. BM25 plus vector finds this directly.

Latency-critical applications under 200 ms: rewriting adds 200 to 600 ms per step. Not viable under hard SLAs.

Cost pressure below a few cents per query: rewriting doubles to triples LLM cost per query. Matters at high volumes.

Corpora with very clear structure and controlled language (e.g. internal glossaries with standardised terms): user queries hit sources directly, rewriting adds little.

If an eval set shows rewriting hurts precision: a sign that the rewriting LLM paraphrases too creatively. Tighten the prompt or pick a smaller model.

Caveat on HyDE in sensitive contexts: the rewriting LLM produces potentially hallucinated content, used only as a vector probe but still appearing in the audit log. In professional-secrecy contexts also store the HyDE output encrypted and with a disposable marker.

Trade-offs

STRENGTHS

  • Retrieval recall@5 typically improves 10 to 25 percent on short/imprecise queries
  • Makes compound and multi-step queries answerable at all
  • Selective routing keeps extra token cost in single-digit percent range
  • Step-back prompting improves legal queries with hierarchical knowledge

WEAKNESSES

  • Extra latency of 200 to 600 ms per rewriting step
  • Token cost overhead of 50 to 200 percent without selective routing
  • HyDE can hurt precision on long queries
  • A cloud-based rewriting LLM widens the Swiss-DSG attack surface

FAQ

Is HyDE worth it on every query?

No. HyDE helps mainly on short, imprecise queries. On detailed questions with 15+ words the retriever finds the right chunks anyway; HyDE then adds marginal gain or even shifts ranking worse. Selective routing via a router LLM is the right strategy.

When is decomposition better than multi-query?

Decomposition is better for compound queries with clear sub-questions. Multi-query is better when a single question must be robustly answered against phrasing variance. The two are not exclusive: on highly critical queries both can run in sequence.

What does query rewriting cost?

Per query about 0.0005 USD for the router (Haiku), 0.001 USD for HyDE or decomposition, 0.003 USD for multi-query (5 paraphrases). At 1000 queries per day with rewriting always on: about CHF 50 to 80 per month extra. With selective routing this drops to CHF 15 to 25.

Is rewriting compliant with Swiss DSG?

If the rewriting LLM is a US provider and the original question references clients, the same applies as for the answer LLM: data processing agreement, transfer impact assessment, possibly consent. Under professional secrecy we recommend running the rewriting LLM locally (Ollama with Llama 3.3, Mistral Small, or Apertus-CH from 2026).

Related topics

RAG · AI CONCEPTRetrieval-Augmented Generation (RAG): how AI answers from your own documentsHYBRID SEARCH · AI CONCEPTHybrid search: BM25 plus vectors with reciprocal rank fusion in Elasticsearch, Qdrant, OpenSearchEMBEDDINGS · AI CONCEPTEmbeddings and vectors: how language becomes mathematicsCHUNKING · AI CONCEPTChunking strategies for RAG: fixed-size, recursive, semantic, late chunkingROUTING · AI CONCEPTMulti-LLM routing: which model when, for how muchMETADATA · AI CONCEPTMetadata and filters in RAG: pre-filter vs post-filter, Qdrant payload index, pgvector WHEREHALLUCINATIONS · AI CONCEPTLimiting hallucinations: five countermeasures against fabricated AI answers

Sources

  1. Gao et al. - Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE) · 2026-05
  2. Zheng et al. - Take a Step Back: Evoking Reasoning via Abstraction in LLMs · 2026-05
  3. LangChain - Query Transformations cookbook · 2026-05
  4. Anthropic - Contextual Retrieval and pre-query rewriting · 2026-05

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call