VESPA · TECH
Vespa: search engine with tensor ranking for complex hybrid pipelines
Vespa is an Apache-2.0 search engine in Java from the Yahoo ecosystem. Tensor ranking, structured plus vector plus full text in one query. Steep learning curve.
Researched & fact-checked by: DuneDive LLC · As of: 2026-05
What is Vespa?
Vespa is an open-source search and recommendation engine under Apache 2.0, originating in the Yahoo ecosystem (originally AllTheWeb, then Yahoo Search). Available since 2017 as its own open-source project, led by the spin-off Vespa.ai. As of May 2026, version 8.x is current. Vespa is written in Java with C++ components for performance-critical paths.
Vespa differs from dedicated vector DBs like Qdrant or Weaviate conceptually. It is primarily a search engine – comparable to Elasticsearch, Solr, OpenSearch – that offers ANN vector search as one of many ranking components. The architecture separates document storage, indexing, and query processing into their own service layers. This separation enables very complex ranking pipelines: a score can combine vector distance, BM25 full text, geographic distance, time decay, personalisation signals, and ML models – all in a single query, in one ranking expression.
The ranking system is Vespas unique selling point. A schema definition describes not only fields but also ranking profiles with tensor operations. Tensor operations are algebraic expressions on multi-dimensional arrays – from dot product through matrix multiply to embedded ONNX models. A RAG system can, for example, perform an initial retrieval step via BM25 plus vector ANN and run a cross-encoder model as reranker directly in the Vespa query in a second step.
The cloud variant Vespa Cloud has existed since 2019 with regions in the USA, Europe (Sweden), and Asia. Self-hosted runs on Kubernetes via the Vespa Helm chart or as a multi-node Docker setup. A single-node variant exists for small setups and dev environments.
For Swiss fiduciary and SME setups, Vespa is rarely the first choice. The learning curve is steep, operational overhead high, and the feature set for pure vector search oversized. Vespa pays off when complex multi-signal ranking pipelines are needed – platforms where personalisation, semantic similarity, and full text are weighted simultaneously.
Why it matters
Most RAG setups in Switzerland do not need Vespa. A fiduciary office with 5,000 receipts and semantic search is fine with Qdrant or pgvector. Vespa starts shining when three conditions coincide: large data volume (typically from 50M documents), multiple search signals that must be weighted together, and a platform requirement with hard response-time guarantees.
Three use cases illustrate this. First: a legal platform that indexes precedents, scholarly commentary, and statutes over 50 years and weights them in one query – semantically similar (vector), literally matching (BM25), recent (time decay), from relevant jurisdiction (filter), preferred in the clients home law (personalisation). Vespa can express all this in one query; in Qdrant it would be a multi-stage pipeline with several services.
Second: a recommendation platform for fiduciary client acquisition that ranks clients by multiple profile vectors (industry, region, size, prior inquiries) and a vector distance to existing success cases. Tensor ranking allows multi-vector distances in one score.
Third: a sub-50ms search platform with 100M entries and 5,000 QPS. Vespas architecture scales horizontally – content nodes with shards, query nodes as stateless layer, automatic routing. Qdrant and Weaviate also scale, but Vespa has more operational maturity here (Yahoo, Yahoo News, Yahoo Finance run on Vespa).
The threshold past which Vespa pays off is high. Anyone searching 1M vectors with filter is productive in a fraction of the time with Qdrant. Vespa justifies its learning curve and operational overhead only when ranking complexity exceeds what hybrid search in Weaviate or Elasticsearch delivers.
How it works
Vespa is deployed as an application package – a directory with schemas, services configuration, ranking profiles, and optional ONNX models. An application package is pushed to the Vespa instance via vespa CLI:
vespa config set application example.myapp.docs vespa deploy --wait 300 my-app/
A minimal schema: schema doc { document doc { field id type long { indexing: summary | attribute } field client_id type int { indexing: summary | attribute } field title type string { indexing: index | summary } field content type string { indexing: index | summary } field embedding type tensor<float>(x[1536]) { indexing: attribute | index, attribute: { distance-metric: angular } } } rank-profile semantic { inputs { query(q) tensor<float>(x[1536]) } first-phase { expression: closeness(field, embedding) } } }
The schema defines a document with full-text fields, a vector field (tensor with dimension 1536), and a ranking profile semantic that uses vector distance as score.
Feed: vespa feed docs.jsonl with one line per document in Vespa document format.
Search via HTTP: GET /search/?yql=select * from doc where userQuery() or ({targetHits:10}nearestNeighbor(embedding,q))&query=invoice&input.query(q)=embed(@query_text)&ranking=semantic
The YQL syntax (Vespa Query Language) combines full-text search, ANN vector search, and filters. The ranking profile determines how hits are combined.
More complex ranking profiles can be multi-phase: rank-profile hybrid { first-phase { expression: 0.5 * bm25(content) + 0.5 * closeness(field, embedding) } second-phase { expression: onnx(crossencoder).score, rerank-count: 100 } }
A cross-encoder model runs as reranker on the top-100 of the first phase. This composition is Vespas strength.
Cluster setup: a Vespa cluster has at least three node types – config server, container (query/stateless), content (document storage with sharding). High availability needs 3+ config servers (Zookeeper consensus) and multiple content nodes with replication.
Vespa to production in 5 steps
- 01Plan cluster architecture: config server (3 for HA), container nodes (stateless layer), content nodes (document storage with sharding/replication). At least 3 servers for production.
- 02Structure the application package: schemas/ with document type, services.xml with cluster configuration, ranking profiles inside the schemas. ONNX models in models/.
- 03Embedding pipeline: compute embeddings externally (OpenAI, Cohere, local sentence-transformers); feed documents in Vespa JSON format via vespa feed.
- 04Define ranking profile: first-phase for fast top-N preselection, second-phase for more expensive rerankers (e.g. cross-encoder ONNX). Optimise rerank-count.
- 05Monitoring and backup: Vespa metrics via Prometheus (latency, throughput, indexing queue), content cluster backups via vespa-rsync; recovery tested.
When to use Vespa
Vespa fits when (a) complex multi-signal ranking pipelines are needed, (b) vector search, BM25, filters, and ML models are combined in a single query, (c) data volume sits permanently above 50M documents, or (d) very high QPS (> 1000/s) at low latency (< 50 ms) is required.
Concrete cases: a legal platform with precedent search, where full text, semantic similarity, jurisdiction, recency, and client personalisation must be weighted together. A fiduciary platform with 500,000 industry profiles that should match clients with similar histories – tensor operations allow multi-vector distances in one score. A compliance system over sanctions lists and news streams that combines BM25 keyword hits with embedding similarity and time decay.
Vespa also fits ML-driven ranking applications: a pre-trained cross-encoder model gets embedded as ONNX into Vespa and executed as second-phase reranker. This saves the roundtrip to an external ML service.
Vespa Cloud (managed) covers setups without Kubernetes. The Europe region in Sweden has been available since 2023; acceptable for Swiss client data with a TIA, with strict Swiss hosting requirements remaining self-hosted.
The team profile is decisive: Vespa pays off when at least one developer brings search-engine experience with Lucene, Elasticsearch, or Solr. Anyone starting from zero needs 2-4 weeks to use Vespa productively – against 1-2 days for Qdrant.
When not to use
For simple semantic search below 5M vectors, Vespa is oversized. Qdrant, Weaviate, or pgvector are usable in a fraction of the time at similar or better performance at this scale.
When the team has no search-engine experience, Vespa is a hard choice. YQL syntax, schema definition with tensor types, ranking profiles, cluster setup with config server, container, and content nodes – these are 3-4 separate concepts that all must be learned. A five-person fiduciary office typically lacks the capacity.
For pure vector search without multi-signal needs, Vespa offers no advantage. Anyone searching "find 10 similar vectors" who needs no BM25 or time decay is faster, simpler, and equally fast with Qdrant or pgvector.
For multi-modal with text-and-image embeddings in one collection, Weaviate offers better tools. Vespa can do it but requires custom tensor definitions and schema logic, which Weaviate ships as a module.
For small on-prem installations without Kubernetes, Vespa is hard. Single-node setup exists but is not the primary path – most Vespa setups run productively in cluster. Anyone seeking a single-node setup with easy maintenance is better served by Qdrant.
For use cases with frequent updates on individual documents (e.g. a user profile changed per request), Vespa is not optimally profiled. Vespa is designed as document-store model – re-indexing on updates is efficient but not as efficient as a pure vector update operation in Qdrant.
Trade-offs
STRENGTHS
- Tensor ranking enables complex multi-signal score computation in one query
- Hybrid full-text+vector+ML-model in a single pipeline
- Scales horizontally to hundreds of millions of documents, operationally proven at Yahoo
- Apache 2.0, Vespa Cloud with EU region (Sweden)
WEAKNESSES
- Steep learning curve – YQL, tensor schema, cluster setup, ranking profiles
- Cluster operation with config server, container, content nodes requires Kubernetes know-how
- Single-node variant is not the primary path – overkill for small setups
- Community smaller than Qdrant/Weaviate, fewer tutorials and Stack Overflow answers
FAQ
How does Vespa differ from Elasticsearch?
Vespa is built from the ground up for ranking with ML models; Elasticsearch is a full-text index that received ANN vector search and ML inference later. Vespa has tensor operations as first-class concept, Elasticsearch not at the same depth. License: Vespa is Apache 2.0, Elasticsearch is Elastic License v2/SSPL. Operationally, Elasticsearch is in more stacks; Vespa has a smaller but more specialised community.
What does Vespa Cloud cost?
As of May 2026: Vespa Cloud bills by resource consumption (vCPU-hours, memory-GB-hours, storage-GB). A starter instance with small data from USD 100/month, productive setups typically USD 1,000-5,000/month. Enterprise tier with SLA, BYOC, and EU-region guarantee on request. Self-hosted is cheaper in direct comparison, but operational overhead must be factored in.
How steep is the learning curve for Vespa?
Noticeably so. Experience values: 2-4 weeks until a developer without search-engine background can formulate a productive Vespa query and tune a ranking profile. With Lucene/Elasticsearch prior experience, 1-2 weeks. Documentation is good but deep – concepts like tensor operations, YQL, ranking profiles, multi-phase ranking need practice. Plan: a first proof of concept takes 2-4 days, productive maturity 3-6 weeks.