VECTOR INDEX · AI CONCEPT

What is a vector index? HNSW, IVF, ScaNN and quantisation May 2026

A vector index is the data structure inside a vector DB that finds similar embeddings fast. Trade-off between recall, latency and memory.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is a vector index?

A vector index is a data structure that allows a vector database to find the embeddings most similar to a query in milliseconds. Without an index the DB would have to compare every stored vector with the query – for 10 million vectors of 1024 dimensions each this is beyond any practical latency. The index structures the vectors so the search grows not linearly but logarithmically or sublinearly.

The central idea is Approximate Nearest Neighbor (ANN). An exact search through all vectors would guarantee the true top-k hits but is too slow. ANN algorithms find the right hits almost always (recall typically 95-99%), but 100-1000x faster. The deliberate sacrifice of perfection is the trick.

In May 2026 one algorithm dominates: HNSW (Hierarchical Navigable Small Worlds). HNSW is the default in Qdrant, Weaviate, Milvus, pgvector from version 0.5 onwards, Elasticsearch, OpenSearch, Pinecone. Other important algorithms: IVF (Inverted File, good for GPU-accelerated FAISS setups), Annoy (Spotify open source, older but extremely memory-efficient), ScaNN (Google, integrated into TensorFlow, very fast at very large corpora). The choice depends on data volume, recall requirement and hardware.

Why it counts

Three properties of a vector index decide the success of a RAG system: recall, latency, memory.

Recall is the share of truly relevant hits the index returns. A recall of 95% means: in 5% of queries the most important hit is missing from the top-k. For a RAG application in a fiduciary context with evidentiary obligations, 99% is the minimum – otherwise the system occasionally answers from the wrong document. Recall is controlled via index parameters (M, ef in HNSW; nlist, nprobe in IVF) but costs latency and memory.

Latency is the response time per query. HNSW on modern CPUs reaches 1-5 ms for 1 million vectors, 5-30 ms for 10 million, 50-200 ms for 100 million – at standard parameters. Latency grows linear-in-log with corpus size, not linear-in-corpus. This makes HNSW scalable.

Memory is the RAM footprint per index. HNSW typically needs 1.5-2x the vector data as index – for 10 million vectors with 1024 dimensions and 4-byte floats that is 60-80 GB. Quantisation (see below) reduces this to 6-15 GB without meaningful recall loss.

More factors: indexing speed (HNSW is slow to build, fast to search – IVF the other way), update capability (HNSW supports inserts/deletes with small performance loss; some older algorithms do not), hybrid search (vector search combined with filter conditions – in May 2026 strong in Qdrant and Weaviate, improving in pgvector with each release).

Algorithms in detail

Four algorithms dominate the field in May 2026.

HNSW (Hierarchical Navigable Small Worlds, Malkov & Yashunin, 2018). Builds a multilayer graph. Upper layers are sparse (few nodes, long connections), lower layers dense. A search starts at the top, jumps roughly into the area of the target, then refines layer by layer downward. The two main parameters are M (connections per node, default 16-32) and ef_construction/ef_search (search depth, default 100-500). Higher ef = more recall, more latency. HNSW is the standard for almost all use cases under 100 million vectors.

IVF (Inverted File, Sivic & Zisserman 2003, FAISS implementation). Splits the vector space into clusters (typically 1000-100000); each query searches only in the nearest nprobe clusters. Fast to index, good on GPU. Recall worse than HNSW at equal resources, but for very large corpora (>100 m vectors) or GPU inference often the better choice. Known from FAISS, Milvus.

Annoy (Approximate Nearest Neighbors Oh Yeah, Spotify 2014). Builds multiple random-projection trees. Very memory-light, in-memory mappable. Weaker recall than HNSW, but where memory is tight and updates are rare, still useful. Known from Spotify recommendation systems.

ScaNN (Scalable Nearest Neighbors, Google 2020). Quantisation-based, very aggressive. With the right configuration achieves the highest throughput per hardware unit – but indexing is intensive. Integrated in TensorFlow Recommenders and some Google Cloud products. Attractive for hyperscale setups, often overkill for SMEs with 1-10 m vectors.

Quantisation as a cross-cutting trend. In May 2026 HNSW is increasingly combined with quantisation: binary quantisation (each dimension reduced to 1 bit – 32x smaller, recall drops to 80-90% without rescoring, with rescoring back to 99%), scalar quantisation (each dimension to 8 or 4 bits, 4-8x smaller, marginally less recall). Qdrant 1.10+, Weaviate 1.25+ and pgvector 0.8 all support quantised HNSW variants. Memory savings 40-90%, recall loss below 1% with a rescoring step.

Index selection in 5 steps

01Estimate data volume: today, in 12 months, in 36 months. Quantity (vectors) and dimension (typically 384, 768, 1024, 1536).
02Set requirements: target recall (95% enough? 99% needed?), target latency (50 ms? 5 ms?), RAM budget, update frequency.
03Pick the algorithm: < 100 m vectors -> HNSW. > 100 m or GPU on hand -> IVF in FAISS/Milvus. Memory tight -> Annoy or quantisation.
04Quantisation strategy: scalar quantisation as default for 4-8x memory savings; binary quantisation + rescoring for very large corpora.
05Measure recall: curate 50-200 Q&A pairs manually, iterate on index parameters (ef, M). Only then go live.

Which index for which case

For 95% of all SME applications the answer is: HNSW with default parameters, optionally with scalar quantisation, in Qdrant. That is the right answer in May 2026 unless you have unusual constraints.

Concrete decision paths.

1 million chunks, RAG over a fiduciary knowledge base. HNSW default in Qdrant. 1.5 GB index in RAM, 2 ms latency, 99%+ recall. Quantisation optional.

10 million chunks, RAG for insurance claims handling with images + text. HNSW with scalar quantisation in Qdrant or Weaviate. 6-10 GB index, 10-20 ms latency. If images need their own index: a second collection.

100 million chunks, group-wide knowledge base. HNSW with binary quantisation and rescoring; alternatively IVF in Milvus with GPU. Check whether sharding is needed (multiple replicas, each with a slice of the index). With correct configuration, latency 30-100 ms.

Very limited RAM, embedded system. Annoy or pgvector with binary quantisation. Recall lower, but mappable in 1-2 GB RAM.

Hybrid search (vector + SQL filter). Qdrant (very strong) or Weaviate. pgvector can do it but weakens with larger volumes. Elasticsearch and OpenSearch can do it natively but are slower than specialised DBs under pure vector load.

Several languages (DE/EN/FR/IT). The index choice is neutral; what matters is the embedding model (Cohere embed-multilingual-v3, BGE-m3, E5-multilingual). That is a different question – see embeddings-und-vektoren.

When no vector index at all

Three cases where a vector index is redundant or harmful.

First: too little data. Under 1000 vectors a brute-force search completes in under 1 ms. HNSW or IVF pay off only from a few thousand vectors onwards – before that the index overhead is larger than the benefit. Qdrant detects this and automatically uses brute force below a threshold.

Second: keyword search is enough. If the application is primarily about exact hits or boolean logic ("all client files with contract date 2024 and status CLOSED"), classical full-text or SQL search is faster and more precise. Vector indexes shine for semantic proximity ("texts that match the question by MEANING"), not for exact filter logic. Hybrid search combines both.

Third: heavily changing data without update capability. HNSW and ScaNN support updates, but frequent incremental indexing costs performance. Whoever has a corpus that changes by 10,000 entries per minute should either re-index periodically or pick an algorithm with proper update support – pgvector with HNSW has become usable for this in May 2026.

Another trap: blindly accepting default parameters without measuring recall. A test suite of 50-200 manually verified Q&A pairs shows whether the index meets the requirement. Without tests you measure nothing and trust defaults – for compliance-relevant applications dangerous.

Trade-offs

STRENGTHS

Sub-second search over millions of vectors – makes RAG practical at all
HNSW is a very robust default – little tuning needed for the 95% case
Quantisation mature in 2026 – 40-90% memory savings without recall loss
Algorithms open source, implementations swappable (FAISS, hnswlib, Qdrant)

WEAKNESSES

Approximate, not exact – 1-5% of hits missing depending on parameters
Indexing can be slow (HNSW: minutes to hours for millions of vectors)
RAM footprint large without quantisation – 1.5-2x the vector data
Parameter tuning needs your own test suite – defaults are not always optimal

FAQ

HNSW vs pgvector – what is faster in May 2026?

Qdrant with HNSW is 2-5x faster than pgvector with HNSW (pgvector 0.8) at equal parameters under pure vector load. Pgvector's advantage: vectors AND classical columns live in the same Postgres DB – a single source of truth. Below 5 m vectors and with existing Postgres infrastructure, pgvector is the calmer choice. From 10 m vectors or high update frequency, Qdrant is the performance pick.

What does binary quantisation cost in recall?

Without rescoring, recall typically drops from 99% to 80-90% – unsuitable for precise applications. WITH rescoring (search top-100 with binary, then re-sort with original vectors) recall returns to 98-99% at a 30-50x smaller index. Qdrant 1.10+ does this natively. In practice: 10 m vectors x 1024 dim x 4 bytes = 40 GB -> binary index 1.3 GB plus original vectors on disk. Rescoring costs 5-15 ms extra latency.

Can I run a vector index on SQLite or MariaDB?

SQLite via sqlite-vec (stable since 2024) yes, usable up to about 100k vectors. MariaDB has its own VECTOR column with HNSW from 11.7 (2024), but as of May 2026 less mature in production than pgvector. For fiduciary/SME setups with < 50k documents and existing MariaDB infrastructure either is an option. For anything larger or for hybrid search, Qdrant or pgvector remains the better choice.

Sources

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call