MXBAI-EMBED · TECH

mxbai-embed: compact Apache 2.0 embedding model for edge setups

mxbai-embed-large-v1 from Mixedbread AI is a 1024-dim open-source embedding, ONNX-capable and fast on edge hardware.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is mxbai-embed?

mxbai-embed is an embedding model family from Mixedbread AI, a startup founded in Berlin in 2024 specialising in compact, high-quality open-source models. The name mxbai stands for mixedbread.ai – a nod to the mix of open-source and commercial tier the company operates on. Models are under Apache 2.0 and freely usable; in parallel Mixedbread offers an API as a paid variant, with its own cloud infrastructure in Germany.

The most used variant in May 2026 is mxbai-embed-large-v1: a 335 million parameter model on a BERT-large base producing 1024-dimensional vectors. Alongside is the smaller mxbai-embed-2d-large-v1 with Matryoshka truncation (1024-dim truncatable to 512/256/128) and mxbai-embed-xsmall-v1 as a very small model for edge cases. The family is small and tidy; this is part of the Mixedbread philosophy – few, well-maintained models instead of large families.

On MTEB mxbai-embed-large-v1 ranks in the upper places among sub-1B parameter models – particularly strong on English, decent on German and French, slightly weaker on Italian. Multilingual coverage is narrower than BGE-M3 or Jina v3; the model was trained primarily on English MS MARCO and Quora corpora with additional multilingual material.

The core advantage in May 2026: compact model size (about 670 MB) plus excellent ONNX performance. With ONNX runtime mxbai-embed runs on a standard Intel Xeon at 40-80 embeddings per second – faster than most 500M models. On Apple Silicon via MLX backend or llama.cpp also fast. The model is ideal for edge setups, mobile embeddings, and backend services with high throughput on limited hardware.

Why it matters for Switzerland

Three arguments speak for mxbai-embed in Swiss setups. First, price/performance in self-hosting. Anyone running a RAG stack on a small Hetzner VM (CPX31 or CCX13) without booking a GPU gets with mxbai-embed a 1024-dim embedding performance fully competitive in English-dominated audit report setups or for English contracts. The ONNX variant runs on 8 cores at about 60 embeddings per second – enough for any SME load.

Second, the Apache 2.0 licence and Berlin origin. Mixedbread AI is a German company from the Berlin ML ecosystem. For an EU-native vendor with full self-hosting freedom, this is the combination. In a client pitch this can be used as a double argument: German origin (politically clean), Apache 2.0 (legally clean).

Third, the Matryoshka variant mxbai-embed-2d. Anyone short on storage or planning a very large collection (over 10M vectors) can truncate to 512 or 256 dimensions and halve or quarter storage. On BEIR the recall loss at 512 is typically 1-2 points – barely measurable in fiduciary contexts. A 5M-vector collection then fits on a single standard VM.

For purely German-speaking corpora with high client load, BGE-M3 remains the stronger pick. mxbai plays its cards in two profiles: English-dominated setups and edge architectures under hardware constraints. In the fiduciary-typical DE/FR/EN mix it is a solid contender.

How it works

mxbai-embed-large-v1 builds on a modified BERT-large architecture with 24 layers and 1024 hidden size. The model was trained in classic contrastive style on a mix of MS MARCO, Quora question pairs, and 4-language web corpora. Unlike E5 or Nomic, mxbai requires no prefix; the model is symmetric and uses the same input convention for documents and queries.

For standard setups via sentence-transformers:

```python from sentence_transformers import SentenceTransformer

model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

documents = [ "Client contests the invoice over item 5.", "Le client conteste la facture concernant la position 5.", ]

vectors = model.encode( documents, normalize_embeddings=True, ) ```

Normalisation is standard and free; it makes cosine similarity equal to dot product and gives a small performance edge in Qdrant.

For production we recommend ONNX. Mixedbread publishes pre-converted ONNX files (mixedbread-ai/mxbai-embed-large-v1, branch onnx). In a slim FastAPI wrapper with ONNX runtime mxbai runs on an Intel Xeon E5-2690 at about 80 embeddings per second for 256-token inputs – faster than most 500M models.

For Apple Silicon setups mxbai-embed runs via MLX backend or llama.cpp. On an M3 Pro: 200-300 embeddings per second via MLX. That makes the model a sensible alternative to Nomic Embed v2 for Mac-centric setups when 1024 dimensions are wanted instead of 768.

The Matryoshka variant mxbai-embed-2d-large-v1 works like this: the model was trained so that the first N dimensions are particularly informative. Truncation to 512 or 256 is a simple slice of the output vectors. In Qdrant you create a collection with the desired dimension and slice before upsert. Measure the recall loss on your own eval suite.

For the commercial Mixedbread API (api.mixedbread.com) an OpenAI-like schema applies. The cloud API costs EUR 0.05 per 1M tokens in May 2026 – cheaper than Cohere and Mistral but more expensive than Voyage Lite. The cloud runs in Frankfurt (eu-central-1). DPA contract is available in German.

mxbai-embed to production in 5 steps

01Pick the variant: -large-v1 for standard 1024-dim, -2d-large-v1 for Matryoshka with 512/256/128, -xsmall for extreme edge cases.
02Choose the inference path: sentence-transformers for prototype, ONNX runtime for production, MLX backend on Apple Silicon.
03Build the FastAPI wrapper: POST /v1/embeddings with batches of 16-32 texts, normalisation on by default, logging without content persistence.
04Create a Qdrant collection with dimension=1024 (or 512/256/128 for Matryoshka), distance=cosine, payload indexes on client, doc_type, language.
05Eval suite with 30-50 real Q/document pairs: measure Recall@5, document the comparison against BGE-M3 or multilingual-e5-large, final per-language choice.

When to use mxbai-embed

mxbai-embed is the right choice when (a) a 1024-dim self-host model on small hardware is wanted, (b) the corpus is English-heavy, (c) Matryoshka truncation saves storage, or (d) an EU vendor with Apache 2.0 plus optional cloud API is required.

Concrete cases: a Swiss auditor with US-oriented clientele indexing audit reports in English. An SME marketing department building a product knowledge portal in DE/EN/FR and preferring edge architecture with small VMs. An office on Apple Silicon Macs wanting to use MLX performance for local inference.

In direct comparison with Nomic Embed: mxbai is 1024-dim (Nomic 768-dim), stronger on English (Nomic stronger multilingual), without prefix convention (Nomic with). Whoever wants 1024-dim standardisation for Qdrant and English-dominated, picks mxbai. Whoever prioritises multilinguality and accepts 768-dim, picks Nomic.

mxbai also fits hybrid stacks well. A central ONNX inference VM runs mxbai-embed-large-v1, several applications call it via a slim HTTP endpoint. The embedding service is centralised; every application accesses a uniform vector norm. That is operationally simpler than loading a separate model per application.

When not to use

When maximum quality on German or French counts, BGE-M3 is the stronger pick. On MTEB-DE and MTEB-FR mxbai-embed-large-v1 trails BGE-M3 by 2-3 points – noticeable in a setup with many clients and fine recall differences.

If you want hybrid retrieval (dense plus sparse) in the same model, mxbai gives you nothing for it – unlike BGE-M3 which delivers both in one call. You would have to run BM25 separately via Tantivy or Elasticsearch.

If you vectorise very long documents – contracts above 5000 tokens, full rulings – mxbais 512-token context limit is a bottleneck. Jina v3 at 8192 tokens or BGE-M3 are better there.

If you bet on maximum standard-framework integration – LangChain, LlamaIndex, Haystack – mxbai is less present than BGE-M3 or OpenAI. Sample notebooks are sparser, community knowledge smaller. A slight disadvantage in a first RAG build, not a showstopper.

Trade-offs

STRENGTHS

Apache 2.0, compact (335M parameters, 670 MB), edge-capable
Very strong ONNX and MLX performance on CPU and Apple Silicon
Matryoshka variant for storage optimisation available
EU vendor (Berlin), optional managed cloud in Frankfurt

WEAKNESSES

On DE/FR 2-3 points behind BGE-M3, not top multilingual
No hybrid retrieval (no sparse or multi-vector like BGE-M3)
Context limit 512 tokens – weak for very long documents
Smaller community than BGE or E5 – fewer tutorials and snippets

FAQ

How does mxbai-embed compare to BGE-M3?

BGE-M3 is multilingual-stronger and delivers dense plus sparse in one call. mxbai is more compact (335M vs 568M parameters), faster in ONNX, slightly better on English. For pure dense embeddings on English mxbai is the leaner pick, for DE/FR/IT multilingual BGE-M3.

Which Matryoshka truncation makes sense?

Rule of thumb: 1024 default; 512 if storage matters (1-2 point recall loss); 256 for extreme cases (4-7 point loss). Below 256 rarely useful – vectors get too noisy. Always measure on your own eval suite, do not trust MTEB numbers.

How high is latency on a CPU VM?

On a CPX31 (4 vCPU AMD EPYC) with ONNX runtime: 60-80 embeddings/s in batch mode, latency per single embedding 15-25 ms. On a CCX23 (8 vCPU dedicated): 120-150 embeddings/s, latency 8-15 ms. Both enough for fiduciary RAG loads.

Can I use the Mixedbread cloud API commercially?

Yes, cloud API EUR 0.05 per 1M tokens, EU hosting Frankfurt. German-language DPA, contract addenda for professional-secrecy-relevant mandates available. Apache 2.0 self-hosted in parallel is possible – vectors are compatible, no migration step between cloud and self-host needed.

Sources

Mixedbread AI – mxbai-embed-large-v1 model card · 2026-05
Mixedbread blog – Matryoshka 2D embeddings explained · 2026-04
ONNX Runtime – embedding-model performance reference · 2026-05
MTEB Leaderboard – Massive Text Embedding Benchmark · 2026-05

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call