MILVUS · TECH

Milvus: cluster vector database for more than one billion vectors

Milvus is an Apache-2.0 vector DB with separated compute and storage layers. GPU acceleration, HNSW plus IVF plus DiskANN, for volumes from 100M vectors.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is Milvus?

Milvus is an open-source vector database under Apache 2.0, developed since 2019 by Zilliz and a top-level project of the LF AI & Data Foundation since 2020. As of May 2026, version 2.5+ is current. Milvus differs from Qdrant and Weaviate through a disaggregated cluster architecture: compute, storage, and coordinator layers run as separate services that scale independently. This separation allows setups where pure vector search runs on many read replicas while updates and index builds attach to a small compute layer.

Milvus supports several index types in parallel: HNSW (default, cosine/dot/Euclidean), IVF_FLAT (classical inverted file), IVF_PQ (product quantisation for smaller RAM footprints), DiskANN (SSD-optimised for datasets that do not fit in RAM), and GPU_IVF_FLAT plus GPU_IVF_PQ (FAISS-based GPU indexes via cuVS and Raft). Index choice is per collection and can be tuned to the workload. For typical Swiss SME loads, HNSW is enough; for volumes from 500M vectors with tight RAM, DiskANN pays off.

The cluster architecture leans on etcd (coordinator state), MinIO/S3 (storage), Pulsar/Kafka (message queue), and several Milvus-native worker pods (query, data, index, root-coord). For smaller setups, Milvus Lite has existed since 2024 as an embedded Python pip package with a DuckDB-like profile. Milvus Standalone runs as a single-node Docker container without the coordinator stack.

The managed cloud is called Zilliz Cloud and is available in several AWS and GCP regions, including eu-central-1 (Frankfurt). Self-hosted runs on Kubernetes via Helm chart or as a Docker-Compose stack.

Why it matters

Milvus becomes relevant once vector load exceeds the limits of a single-node DB. Three scenarios trigger this: a client base with millions of documents, a platform with hundreds of clients and a data island each, or a research use case with embeddings over very large corpora (e.g. 30 years of Swiss jurisprudence).

For a five-person fiduciary office, Milvus is typically oversized. Once a platform scales to multi-client setups storing 1-10M vectors per client, the architectural decision shifts. Milvus has two properties that matter in this profile.

First: compute-storage separation. Read load scales over additional query nodes without touching the storage layer. Backup runs on the storage side (S3-compatible object store), independent of the compute side. For an audit under Art. 957a CO, clean separation of data and processing is helpful – the data layer can be versioned and backed up independently.

Second: index flexibility. A collection can start with HNSW and later re-index to DiskANN if the RAM footprint becomes too large. IVF_PQ reduces the RAM footprint by a factor of 10-30 at a moderate recall cost. For setups where hardware cost must be weighed against recall quality, Milvus offers more levers than Qdrant or Weaviate.

Operational overhead is markedly higher than Qdrant. Anyone running Milvus in production needs Kubernetes experience, understanding of etcd backups, monitoring across several service layers. For a Swiss fiduciary without platform ambition, that is rarely justified – for a vertical platform with a scaling plan, often the right choice.

How it works

Milvus follows a service-oriented cluster model. A collection is created via the Python, Java, Go, Node, or REST client with a schema definition: fields (vector, int64, varchar, json, …) and index configuration.

Example via PyMilvus: from pymilvus import MilvusClient client = MilvusClient(uri="http://milvus:19530") client.create_collection(collection_name="docs", dimension=1536, metric_type="COSINE", index_params={"index_type": "HNSW", "M": 16, "efConstruction": 200})

Insert via insert([{vector: [...], doc_id: "x", client_id: 42, content: "..."}]). Milvus accepts batches up to 10 MB per request; for larger loads, bulk import via S3 file (Parquet, JSON) pays off.

Search via search(collection_name="docs", data=[query_vector], limit=10, filter="client_id == 42", output_fields=["content"]). Filters are written as boolean expressions (Milvus-native syntax with ==, in, like). Filters are evaluated inside the index, not after the top-k pass, provided a scalar index exists on the filtered fields.

Multi-tenant separation runs via partitions or separate collections. Partitions are logical subsets of a collection with their own filter context – a query can be limited to a specific partition, which is faster in index lookup than a filter over the whole set.

GPU indexes run on the milvus-gpu image and require Nvidia CUDA 12+. A collection with GPU_IVF_FLAT runs 5-10x faster in search than the CPU equivalent, but costs a GPU in the cluster (typically A10 or L4 for inference use cases).

Backup runs via the milvus-backup tool: a collection is written as a snapshot to S3 (schema plus data plus index metadata). Recovery pulls the snapshot back. Multi-region replication via Milvus 2.4+ multi-master configuration is possible, overkill for most setups.

Milvus to production in 5 steps

01Architecture decision: Milvus Lite (embedded, < 1M) vs Standalone (single Docker, < 50M) vs Cluster (Kubernetes, > 50M). Estimate data volume honestly.
02Deploy Helm chart or Docker-Compose; set up etcd, MinIO, and Pulsar as coordinator and storage layer. Persistent volumes for every service.
03Plan the collection: dimension, distance, index (HNSW default, IVF_PQ for tight RAM, DiskANN for very large sets, GPU_* only with Nvidia hardware).
04Set up partitions or separate collections for multi-tenant; create scalar indexes on filter fields, otherwise filtered search does not scale.
05Configure backups via the milvus-backup tool into S3-compatible storage; monitor via Prometheus on query, index, and coordinator metrics.

When to use Milvus

Milvus fits when (a) data volume sits permanently above 50-100M vectors, (b) read load must scale independently of write load, (c) GPU acceleration is needed for very high search QPS, or (d) multi-region replication is part of the requirement.

Concrete cases: a vertical platform for fiduciary offices with 200 clients and 500,000 vectors each – 100M vectors mapped as a collection-per-client structure end up at 200 collections, which Qdrant still handles, but Milvus models cleaner as partitions of a single collection. A legal retrieval system over Swiss Federal Court rulings of the last 30 years that must deliver sub-5 ms response times at 50M vectors via GPU indexes. A recommendation system with non-linearly rising load whose compute layer must autoscale.

Zilliz Cloud (managed) fits setups where Kubernetes operation is not wanted and a US/EU cloud provider is acceptable. The EU region in Frankfurt covers most DACH cases; for strict Swiss cases, self-hosted on Hetzner or Infomaniak remains the clean choice.

When not to use

Below 10M vectors, Milvus is clearly oversized. Qdrant or pgvector solve the task at a fraction of the operational cost. A Milvus cluster with etcd, MinIO, and Pulsar needs 4-8 GB RAM baseline for coordinator services alone; at 100,000 vectors that is absurd.

If the team has no Kubernetes know-how, Milvus is the wrong choice. Milvus Standalone (single-node Docker) reduces complexity but drops the cluster advantages – anyone running Standalone is typically better off with Qdrant.

For pure multi-modal use cases without scaling pressure, Weaviate is the better fit. Milvus has no built-in embedding modules – the vector must be computed externally and passed in. For hybrid search, Milvus has offered native BM25-plus-dense since version 2.4, but Weaviates GraphQL idiom is more mature.

Milvus Lite is interesting for notebook use cases – as an embedded Python package, it compares to Chroma. But Lite is not API-compatible with cluster Milvus in all details; a direct Lite-to-cluster path does not exist without reconfiguration.

Trade-offs

STRENGTHS

Disaggregated architecture – read and write scale independently
Multiple index types (HNSW, IVF, DiskANN, GPU) selectable per collection
GPU indexes for very high search QPS, FAISS-based via cuVS
Apache 2.0, LF AI & Data Foundation, active community

WEAKNESSES

High operational overhead with etcd, MinIO, Pulsar as cluster dependencies
No built-in embedding module like Weaviate
Single-node standalone drops the central cluster benefits
Oversized for SMEs below 10M vectors

FAQ

What is the difference between Milvus and Zilliz Cloud?

Milvus is the open-source software under Apache 2.0. Zilliz Cloud is the managed SaaS offering by Zilliz (the main Milvus developer). Zilliz Cloud removes operational overhead – etcd, MinIO, Pulsar, worker scaling – for monthly costs from USD 65 per compute unit. EU region Frankfurt available; strict Swiss cases stay self-hosted.

When does GPU indexing pay off?

Only above several hundred million vectors with high search QPS (> 1000/second). For typical fiduciary setups with < 10M vectors and < 100 queries per hour, CPU-HNSW is fully sufficient. GPU pays off only when hardware costs for multiple CPU replicas exceed the GPU cost.

How operationally heavy is the cluster?

Noticeably so. A production Milvus cluster needs continuous monitoring across 6-8 service pods (root-coord, query node, data node, index node, proxy plus etcd, MinIO, Pulsar). Updates must run in the correct order. Backup has multiple components. Plan: 4-8 hours setup, 2-4 hours monthly operational overhead on a stable system. If that is not feasible: Zilliz Cloud or Qdrant.

Sources

Milvus documentation – architecture, index types, partitions · 2026-05
milvus-io/milvus – GitHub releases v2.5+ · 2026-05
Zilliz Cloud pricing and EU region availability · 2026-05
Milvus blog – GPU indexes via cuVS and Raft · 2026-04
ANN-Benchmarks – Milvus performance comparison · 2026-03

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call