PROMPTING · AI CONCEPT

Prompt engineering: foundations, patterns, anti-patterns

System prompt, few-shot, structured outputs, refusal patterns. What still matters in May 2026 – and what models now handle internally.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is prompt engineering?

Prompt engineering is the craft of describing a task to a language model so the answer is reliably useful. In May 2026, this is a different discipline than in 2023. Models like the current top Claude model, GPT-4o, Gemini 2.5, and Mistral Large 2 handle many earlier tricks internally. Chain-of-Thought, Self-Consistency, and Tree-of-Thoughts are today less prompt patterns than model features – the models decide on their own when to reason step by step.

What remains: a clearly structured prompt yields faster, cheaper, and more reproducible answers. A 50-word prompt gets a 50-word answer plus hallucinations. A 500-word prompt with system role, examples, format spec, and refusal clause gets a 50-word answer plus citations plus a clear "I do not know" when data is missing.

For fiduciary and law offices, prompt engineering is no playground. It is part of AI governance: which prompts run in production? Who reviewed them? How is prompt leaking prevented? Those questions belong in the AI audit-trail design (see related article).

Why it matters

The prompt is the interface between human and language model. Bad prompts cost threefold: higher token bills (more back-and-forth before a usable answer), worse quality (hallucinations from unclear instructions), legal risk (no refusal pattern → the model invents legal advice).

Costs are real. OpenAI bills per token. A non-optimised prompt for a VAT question can produce 4,000 input tokens + 2,000 output tokens – on GPT-4o that is roughly USD 0.03. An optimised prompt delivers the same answer with 1,500 input + 500 output = USD 0.008. At 5,000 queries per month, that is a difference of USD 110 vs. USD 40 – per client.

The legal dimension weighs more. A Swiss SME fiduciary who offers an AI system to clients as an "answer assistant" is liable for misinformation. If the system runs without a clear refusal pattern and emits an invented tax rule, liability sits not with the model vendor but with the system operator (see Art. 41 CO, duty of disclosure).

The building blocks of a production prompt

System prompt vs. user prompt: the system prompt defines role, behaviour, format. It is set once and stays for the whole conversation. The user prompt holds the concrete question. Separation matters: at OpenAI and Anthropic, system prompts are weighted differently by the model than user prompts and are less prone to prompt injection.

Few-shot vs. zero-shot: zero-shot means the model gets only the task. Few-shot means the prompt contains 2–5 examples ("input X → output Y"). Few-shot clearly improves precision on structured outputs. As of May 2026, it mostly helps with classification and extraction tasks; for free-form answers, zero-shot with a good instruction is often equally good.

Chain-of-Thought (CoT): explicit instruction "think step by step" or "reason before answering". Once essential, now mostly redundant – the current top Claude model and GPT-4o do this on their own when the task needs reasoning. Where CoT still helps: explicit audit logs for tax or legal questions where the reasoning path must be documented.

Structured outputs: JSON mode at OpenAI (response_format: {type: "json_schema"}), tool/function calling at Anthropic. Both force the model to answer in a defined schema. Mandatory for anything processed downstream – receipt recognition, email classification, VAT triage. Without structured output, you have to parse the answer, which often fails on free text.

Refusal pattern: explicit instruction "if the answer is not in the given sources, respond with -not in the material-" – safety net against hallucinations. Mandatory in any RAG setup under professional secrecy. Anthropic models respond especially well to such refusal clauses.

Citation required: "cite every statement with [1], [2] and list the sources at the end" – makes the answer audit-ready. Works on all modern models, but is no silver bullet: a citation-check step after generation verifies that cited sources actually appeared in the retrieval result.

Production prompt in 6 steps

01State the task precisely: what is input, what is the expected output format, which refusal cases exist?
02Build a test set: 20–50 real examples with expected results. Without a test set, everything that follows is faith, not measurement.
03Write a baseline prompt: system prompt (role, behaviour, format) + user-prompt template (variable slot). Force structured output (JSON schema) where downstream processing happens.
04Add a refusal clause: "if the answer is not in the given sources, respond exactly with 'not in the material'."
05Add few-shot examples if the baseline lands below 80% accuracy. 3–5 examples are usually enough; more burns tokens with no gain.
06Iterate against the test set: change, measure, repeat. Only deploy to production at stable > 90%. Version prompts like code (git).

When to invest in prompt engineering

On every production prompt. Production means: the prompt runs more than 100 times a day, is seen by clients, or feeds decisions. For one-off research prompts ("explain the AMLA revision briefly to me"), the investment is excessive – just ask.

For a Swiss fiduciary SME, three prompt categories typically deserve optimisation: (a) the RAG answer prompt for client questions (runs hundreds of times a day), (b) the email classification prompt (payroll, tax, dunning, other), (c) the receipt extraction prompt (date, amount, VAT rate, supplier). These three together usually account for 80% of token consumption.

Good approach: first define a measurable task ("extract all four fields correctly on 100 test receipts"), then iterate the prompt and validate against the test set. Without a test set, prompt engineering is superstition.

When prompt engineering is overkill

When the task is easy and the model solves it without special handling, any extra prompt complexity is counterproductive. 2026 models are so good that a clear one-sentence prompt suffices for many tasks. Telling the model "you are a senior tax expert with 30 years of experience" hurts more than it helps – such role tricks mattered in 2023 and are noise today.

Also, not every problem is a prompt problem. If a model emits wrong VAT rates, no prompt fixes it – the answer is RAG with the current VAT regulation. If a model performs poorly in multilingual settings, no instruction fixes it – the answer is a better embedding model or a stronger generator (Claude Opus instead of Haiku). Before every prompt tuning sits the question: is this even the right tool?

Trade-offs

STRENGTHS

Better prompts save 50–70% token cost at equal quality
Structured outputs enable direct downstream processing without parsing
Refusal patterns measurably reduce hallucinations
Versioned prompts are audit- and compliance-ready (ISO 42001)

WEAKNESSES

Model-specific: a Claude prompt is not 1:1 transferable to GPT
Model updates can break prompts – continuous test sets required
Overprompting (too many instructions) hurts quality – trade-off non-trivial
Prompt engineering alone solves no data problems (RAG, embedding, model choice)

FAQ

What is prompt leaking and how do I protect myself?

Prompt leaking means an end user tricks the model into emitting the system prompt verbatim. That exposes internal instructions, refusal rules, sometimes business logic. Three measures: (a) put no secrets in the prompt (passwords, internal margin calculations), (b) mark system-prompt contents as "do not emit, do not quote", (c) a separate output-filter stage that blocks responses containing system-prompt fragments. Leaking cannot be fully eliminated.

Should I use Markdown or XML in prompts?

Anthropic models respond strongly to XML tags (`<context>...</context>`, `<question>...</question>`) – Anthropic explicitly recommends this in its prompting guide. OpenAI models are less XML-sensitive; there, Markdown (## headers, **bold**, lists) works equally well. Practical tip: XML for Claude, Markdown for GPT. Both tolerate the other without significant quality loss.

How do I version prompts in production?

Like code. Prompts belong in Git, not in a database column. A proven structure: a prompts/ directory with one file per prompt (e.g. prompts/vat_triage_v3.md), changes via pull request, every production version tagged with semver. Tools like Langfuse, PromptLayer, or Helicone additionally track per-version performance metrics. That makes it auditable which version ran when and how it evolved.

How long may a prompt be?

Technically models allow 200k to 2M tokens of context (Gemini 2.5 leads). Practically, answer quality drops past 32k tokens – the "lost in the middle" phenomenon: content in the prompt middle is processed less reliably. Rule of thumb: system prompt below 1,500 tokens, total prompt with retrieval context below 16,000 tokens. If you need more, improve retrieval first, do not lengthen the prompt.

Sources

OpenAI – Prompt Engineering Guide (official) · 2026-03
Anthropic – Claude Prompting Documentation · 2026-04
Microsoft – Prompt flow documentation · 2026-02
Liu et al., Lost in the Middle: How Language Models Use Long Contexts (TACL) · 2023-07

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call