EMBEDDINGS · AI CONCEPT
Embeddings and vectors: how language becomes mathematics
Embeddings are numerical representations of text, image, or audio. They are the foundation of every semantic search and every RAG system.
Researched & fact-checked by: DuneDive LLC · As of: 2026-05
What are embeddings?
An embedding is a list of numbers – a vector – that places a text, image, audio snippet, or code fragment in a high-dimensional space. The list is not random: similar content lands near each other, dissimilar content further apart. The sentence "client pays late" sits near "customer is in arrears" – even though not a single word matches. That property is what makes semantic search work.
Technically, an embedding model produces the vectors. Common models in May 2026 emit vectors with 384, 768, 1024, or 3072 dimensions. More dimensions mean finer distinctions but also more storage and slower search. For German, BGE-large-de (1024 dim, runs locally), OpenAI text-embedding-3-small (1536 dim, USD 0.02 per 1M tokens), and Cohere embed-multilingual-v3 (1024 dim, 100+ languages) are the three dominant options.
The vectors live in a vector database (see comparison). A query is embedded by the same model and compared against the corpus via a distance metric. Three metrics are common: cosine similarity (default, scale-invariant), dot product (faster, magnitude-sensitive), and euclidean distance (rare, for geometric data). With normalised embeddings, cosine and dot are interchangeable – many production setups choose dot because it is faster.
Why it matters
Without embeddings there is no RAG, no semantic search, no usable AI assistant on your own documents. Classic full-text search (Lucene, Postgres tsvector, Elasticsearch) matches keywords. Searching for "default" misses "late payment" – even though the answer sits there. Embeddings close that gap.
For fiduciary and law offices, language variety is decisive. Clients write in German, English, occasionally French, with some business records in Italian. A well-chosen multilingual model (Cohere embed-multilingual-v3 or BGE-multilingual) enables cross-lingual retrieval: a German question finds an English contract because semantic proximity works across languages.
Costs have dropped dramatically in the past 18 months. text-embedding-3-small costs USD 0.02 per 1M tokens (as of May 2026, source: OpenAI pricing). A 10,000-document knowledge base with an average 5,000 tokens per document costs a one-time USD 1 for embedding. Storage and re-indexing stay under CHF 5 per month. The economic barrier has disappeared.
How it works
An embedding is produced by a neural network trained on millions of texts. The last layers emit the vector. For transformer-based models (all modern embedding models), the CLS token or a mean-pool over all token embeddings is typically used as the final representation.
Dimensions: 384 (smallest reasonable value, e.g. all-MiniLM-L6-v2, very fast) up to 3072 (text-embedding-3-large, best quality). Rule of thumb: 768 or 1024 is the sweet spot for most SME use-cases. 3072 only pays off for very fine distinctions – for example two similar contract clauses whose difference matters legally.
Distance metrics: cosine similarity measures the angle between two vectors (1.0 = identical, 0 = orthogonal, -1 = opposite). Dot product equals cosine when both vectors are normalised (length 1). Euclidean distance measures direct distance but becomes less discriminative in high dimensions (curse of dimensionality).
Model selection in Switzerland, as of May 2026:
- BGE-large-de (BAAI, open-source, 1024 dim): runs locally on Hetzner CPU, no data egress, top MTEB rank for German. Ideal for strict revDSG mandates. - OpenAI text-embedding-3-small (1536 dim, truncatable to 256–1536): default if US hosting is acceptable, strongest multilingual variant in the OpenAI family. Truncation ("Matryoshka") saves storage with little quality loss. - Cohere embed-multilingual-v3 (1024 dim): the strongest choice when DE/FR/IT/EN are mixed; EU hosting via Cohere-EU available. - Mistral Embed (1024 dim, EU-hosted): young, good for EU data residency.
MTEB leaderboard (Massive Text Embedding Benchmark, Muennighoff et al. 2022): the running reference for embedding quality. Anyone making a choice should check the top 20 for the target language.
Embedding workflow in 6 steps
- 01Clarify language and volume: mainly German? multilingual DE/FR/IT/EN? how many documents? That determines the model.
- 02Choose the model: BGE-large-de for German local, text-embedding-3-small for cloud default, embed-multilingual-v3 for multilingual – validate against MTEB leaderboard for the target language.
- 03Decide the distance metric: cosine is default; with normalised vectors, dot product for a performance gain.
- 04Decide the dimensions trade-off: 1024 or 1536 as default; 384 only for very large corpora with latency pressure; 3072 only for very fine distinctions.
- 05Implement the embedding pipeline: cut documents into chunks (300–800 tokens), push through the model, store vectors with metadata in a vector DB (Qdrant).
- 06Measure quality: maintain 30–50 real example questions with expected results, monitor Recall@5 and MRR over time, trigger re-embedding on drift.
When to use embeddings
Embeddings are needed by any system that must understand text semantically: search, classification, clustering, duplicate detection, recommendation. In practice: a RAG system for client inquiries, a classifier for incoming email (payroll, tax, dunning), a duplicate check for receipts, similarity search for legal precedents.
Embeddings become especially useful when queries are vague or paraphrased. A client asks: "What do I need to consider when moving away from Zurich?" – the answer lives in a document titled "Change of residence and tax consequences Canton Zurich". Full-text search misses that; a decent embedding model finds it. For multilingual corpora (Swiss-typical DE/FR/IT/EN) embeddings beat the classic index.
When not to use
Not every search case needs embeddings. For structured data with clean fields (client number, date, amount) SQL stays the right choice – faster, cheaper, exact. For code search with exact symbols (function names, classes) ripgrep or Sourcegraph delivers more precise hits than any embedding model.
Embeddings fail when semantic proximity is misleading. An embedding sees "dissolve contract" and "draft contract" as very close – yet legally the difference is maximal. Such cases need either a cross-encoder reranker (BGE-reranker, Cohere Rerank 3) or extra filter logic. Embeddings alone are not precise enough for hard decisions.
Trade-offs
STRENGTHS
- Semantic search instead of keyword match – also catches paraphrases
- Multilingual: a German question finds an English hit
- Has become very cheap (< CHF 5 per month for a typical SME corpus)
- Model choice by data residency: local (BGE) or cloud (OpenAI, Cohere, Mistral)
WEAKNESSES
- Model switch = re-embedding the entire corpus (no in-place upgrade)
- High dimensions cost storage and latency – trade-off non-trivial
- Embedding quality is language-dependent – not every model is good for German
- Semantically close is not legally correct – cross-encoder rerank or filtering required
FAQ
How do I switch embedding models without re-indexing everything?
You cannot. Embeddings are model-specific – vectors from model A are not compatible with model B because they live in different spaces. A model switch always means re-embedding the entire corpus. For 10,000 documents, that costs roughly USD 1 with text-embedding-3-small and takes less than an hour on a standard server. Tip: before production rollout, compare 2–3 models on a 200-document sample, then pick the right one.
How large are the vectors really on disk?
A 1024-dimensional float32 vector takes 4 KB. One million vectors = 4 GB raw. With quantisation (Qdrant scalar quantisation, binary quantisation), this drops to 1 GB or less with surprisingly little recall loss (see Qdrant benchmarks). Plus index overhead (HNSW needs roughly 50% extra). Rule of thumb: budget twice the bare vector size.
Can I train embeddings myself?
In theory yes, in practice rarely useful. General models have become so good that own training only yields a measurable advantage in very special domains (patent texts, recipe databases, medical code). For 95% of SME cases, fine-tuning or cross-encoder reranking on a standard model is the better lever – less effort, more gain.
What does one million embeddings cost in May 2026?
With text-embedding-3-small (USD 0.02 per 1M tokens) and an average 200 tokens per embedding, that is 200M tokens = USD 4. With BGE-large locally: zero cloud cost, just the one-off server time (roughly 10–20 hours on a CPU machine). Re-embedding is the main cost line – monthly re-indexing for a mid-size corpus runs about USD 50–100 per year.
Related topics
Sources
- Muennighoff et al., MTEB: Massive Text Embedding Benchmark (arXiv) · 2022-10
- OpenAI Embeddings – Models & Pricing (text-embedding-3-small/large) · 2026-04
- Cohere Embed v3 – multilingual model docs · 2026-03
- BAAI BGE – Open-Source Embedding Models · 2025-09
- MTEB Leaderboard (Hugging Face Spaces) · 2026-05