NOMIC EMBED · TECH

Nomic Embed: locally runnable open-source embedding model

Nomic Embed v2 is an Apache 2.0 model with 768 dimensions, transparent training-data documentation, and excellent local performance on Mac and Linux.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is Nomic Embed?

Nomic Embed is a model family from Nomic AI, an open-source company founded in 2022 in New York dedicated to democratising AI components. Unlike most embedding providers, Nomic publishes its models, training code, and even training data in full – a detail that makes a real difference for audit-capable RAG setups.

The current generation in May 2026 is nomic-embed-text-v2, a mixture-of-experts model with roughly 305 million active parameters (475M total). It produces 768-dimensional vectors with Matryoshka truncation to 256 or 128 dimensions, supports about 100 languages with focus on English, Spanish, French, German, and Chinese. On MTEB-DE v2 sits roughly level with multilingual-e5-base, behind the top models BGE-M3 and Cohere embed-multilingual-v3.

Nomic Embeds trump card in May 2026 is not the top benchmark score but the combination of three properties. First: Apache 2.0 licence for model, code, and data – anyone who needs a complete audit trail can reproduce the training. Second: local runnability without GPU. The model runs via Ollama, llama.cpp, or ONNX runtime on Apple Silicon Macs (M1, M2, M3, M4) surprisingly fast – typically 50-100 embeddings per second on an M2 Air. Third: small model size – about 540 MB in float16, fits in the RAM of a standard laptop.

This makes Nomic Embed the model of choice for privacy-by-distance setups: embedding happens directly on a lawyers or fiduciarys notebook without any document ever reaching a server. The vector is then sent encrypted to a central Qdrant instance – the original stays local. This architecture is increasingly requested in especially sensitive mandates (notary work, family offices, confidential arbitration).

Why it matters for Switzerland

Three points make Nomic Embed interesting in the Swiss context. First, audit capability. In a formal compliance review – say for an audit body under Art. 957a CO or an external FINMA inspection – you must be able to demonstrate that the embedding model used was trained on known data. With OpenAI or Cohere the training-data list is confidential. With Nomic it is published. In the compliance brief this is a clear advantage.

Second, local performance on Apple Silicon. Swiss fiduciary and law firms use disproportionately many Macs (source: ZHAW 2025 study on IT in the professions). On an M2 Pro or M3 Pro Nomic Embed runs via Ollama in the background with minimal energy. An embedding pipeline across 1000 documents takes roughly 15 minutes on an M3 Air – fast enough for ad-hoc indexing during the mandate.

Third, the Apache 2.0 licence and small footprint. Anyone who wants to build a RAG pipeline so small that maintenance effort stays minimal even after 10 years is well served by Nomic. No vendor API, no update threat, no multi-GB container. A simple model, a simple inference loop, done.

For pure server setups or for corpora with DE/FR focus there are better options – BGE-M3 is stronger, multilingual-e5-large too. Nomic plays its cards when the architecture is edge-oriented or transparency is a high requirement. In a setup designed around professional secrecy and nFADP compliance, training transparency is an argument that carries weight in a client question about the AI architecture.

How it works

Nomic Embed v2 uses a mixture-of-experts setup on a BERT-like encoder base. Per input token two of eight experts activate – effective inference at moderate parameter budget. The model was trained in contrastive style on a mix of publicly documented corpora: MS MARCO, Natural Questions, Multilingual Wikipedia, and a collection of web-scraped question-passage pairs.

Important at inference: Nomic v2 expects a task prefix, similar to multilingual-e5. Document inputs start with "search_document:", queries with "search_query:". Other prefixes: "classification:" and "clustering:". Skipping the prefix costs recall.

The easiest local integration is via Ollama:

```bash ollama pull nomic-embed-text:latest ```

```python import requests

resp = requests.post( "http://localhost:11434/api/embed", json={ "model": "nomic-embed-text", "input": [ "search_document: Client contests the Q3 2025 account statements.", "search_document: Le client conteste les releves de compte du T3 2025.", ], }, ) vectors = resp.json()["embeddings"] ```

For Mac setups Ollama is the dominant path. On Linux/server setups the model runs equally via Ollama or directly through HuggingFace Transformers or llama.cpp.

For maximum performance on server CPU, ONNX runtime is recommended. Nomic publishes ONNX variants on HuggingFace (nomic-ai/nomic-embed-text-v2-moe). In a slim FastAPI wrapper the model runs on an AMD EPYC server at 100-150 embeddings per second – faster than most other multilingual models because mixture-of-experts activates fewer parameters per token.

Matryoshka truncation allows cutting to 256 or 128 dimensions without re-embedding. At 128 dimensions an extremely small vector results (512 bytes per point at float32, 128 bytes with quantisation) – ideal for mobile or edge setups under storage pressure. Recall loss at 256 is typically 2-4 points versus full 768-dim; at 128 typically 6-10 points. Anyone saving storage must measure this trade-off.

Nomic Embed to production in 5 steps

01Pick the inference path: Ollama (Mac workstation or server), direct via Transformers/llama.cpp, or ONNX runtime in a FastAPI wrapper.
02Pull the model: ollama pull nomic-embed-text or model files from nomic-ai/nomic-embed-text-v2-moe on HuggingFace.
03Add prefix logic: search_document for documents, search_query for queries. Test early – forgotten prefix is the top rookie mistake.
04Create the Qdrant collection with dimension=768 (or 256/128 for Matryoshka), distance=cosine, payload index on client and doc_type.
05Eval suite against baseline: 30-50 real Q/document pairs, measure Recall@5 at full 768-dim vs. truncated, document the storage-vs-quality trade-off.

When to use Nomic Embed

Nomic Embed is the right choice when (a) the pipeline must run on endpoints (laptop, Mac) and not on server inference, (b) training-data transparency is a compliance requirement, (c) an extremely small model with a small footprint is required, or (d) a local Ollama setup with LLM and embedding from one source should be unified.

Concrete cases: a Zurich or Geneva notary handling documents per mandate exclusively on the notarys MacBook – embedding stays local, original never leaves the device. A family office indexing wealth memos in a reproducibly documented pipeline with external audit oversight. A startup that allows only Apache 2.0 components in the stack for licensing reasons.

For standard RAG setups with a clear DE focus Nomic is not first pick – BGE-M3 or multilingual-e5-large are 2-4 points better on MTEB-DE. Nomic plays its cards in edge architecture, audit capability, and direct Ollama integration.

A special synergy emerges with local LLMs over Ollama. Anyone running Llama 3.3 70B, Qwen 2.5, or Gemma 3 locally can let Nomic Embed use the same Ollama instance – one model server, one update routine, one logging path. This architectural simplification is rising under data-protection-strict mandates in Switzerland as of May 2026.

When not to use

When maximum recall on German or French counts, BGE-M3 is the better pick – 2-4 points more on MTEB-DE/FR. In a setup with a large vector DB and heavy client load that difference is noticeable.

If your inputs are mainly very long (contracts above 5000 tokens), Nomic v2 with 8192 token context is technically capable but not specifically optimised for long context. BGE-M3 or Jina v3 are more advisable there.

If your stack is entirely server-centric with no edge or endpoint component, the Nomic advantage is wasted. On pure server hardware multilingual-e5 is faster at comparable quality, BGE-M3 stronger at moderately higher load.

If you do not care about training-data transparency and only look at benchmark scores, Nomic is not in the multilingual top 5. Cohere embed-v3 or BGE-M3 are ahead.

Trade-offs

STRENGTHS

Apache 2.0, model + code + training data fully published
Very fast on Apple Silicon via Ollama – edge setups practical
Small (540 MB), small footprint, small hardware budget
Matryoshka truncation to 256 or 128 dimensions

WEAKNESSES

On MTEB-DE 2-4 points behind BGE-M3 and Cohere embed-v3
Prefix convention (search_document / search_query) as a rookie mistake source
Only 768-dim, no high-volume 1024-dim or 3072-dim variant
Multilingual solid but not at BGE-M3 level

FAQ

How fast is Nomic Embed on a MacBook?

On an M2 Air (8 GB RAM): roughly 50-100 embeddings per second via Ollama with 512-token inputs. On an M3 Pro with 18 GB: 150-250 per second. On a MacBook Pro with M4 Max: over 400 per second – comparable to a mid GPU. Apple Silicon is a sweet spot for Nomics MoE architecture.

Is the training data really fully published?

Yes, Nomic AI publishes both the datasets and datasheets after the Gebru et al. standard. Compliance audits can go through the list and verify – no OpenAI-style confidentiality. This is rare in the market and a clear pro for formal audits under Art. 957a CO or ISO 27001.

How does Nomic Embed compare to mxbai-embed-large?

Both are Apache 2.0, self-host, small. Mxbai is 1024-dim and English-stronger, Nomic is 768-dim and multilingual-stronger with Matryoshka. For DE/FR/IT-heavy corpora Nomic, for purely English with storage optimisation mxbai. Both are very close to the quality of multilingual-e5-base.

Can I use Nomic Embed offline?

Yes, completely. Pull the model once via Ollama pull or HuggingFace download, then everything runs offline. No API calls, no telemetry-by-default tracking. A clear plus for notaries or clients with air-gap requirements.

Sources

Nomic AI documentation – nomic-embed-text-v2 model card · 2026-05
Nomic AI blog – Open-source training-data documentation · 2026-04
Ollama embeddings – nomic-embed-text integration · 2026-05
MTEB Leaderboard – Massive Text Embedding Benchmark · 2026-05

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call