MARTIAN · TECH

Martian: model router with embedding classifier per request

Martian (withmartian.com) is a US cloud router that picks the best LLM per request via embedding comparison. Experimental, USD 0.50-2 per 1M tokens surcharge.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is Martian?

Martian (withmartian.com) is a US-based startup headquartered in San Francisco, founded in 2023. The core idea: instead of hardcoding a model per request, the decision is delegated to a classifier. The classifier compares the embedding of the incoming request with stored profiles per model and picks the one that historically performs best for this request category. The result: theoretically lower cost and better answers, because expensive models are only invoked for complex requests.

The product is proprietary and cloud-only with hosting in the US (AWS us-east). As of May 2026, there is no self-host tier and no EU region. The endpoint is OpenAI-compatible (https://withmartian.com/api/openai/v1); in the model field you set router/martian-router-v3 or an explicitly chosen provider/model name. The router then picks among around 20 supported upstream models (Claude family, GPT family, Mistral, Llama, Gemini).

The business model is a USD markup per million tokens: typically USD 0.50 for light routing decisions, USD 1-2 with a full classifier inference pass. The actual model token price is passed through (no extra cut on the provider price). Overall Martian is more expensive than a simple direct provider call but can lower total cost with good classification, because requests sufficiently answered by a 7B model do not land on GPT-4o.

For fairlane.systems mandates, Martian is, as of May 2026, an experimental tool. We have tested it in pilot setups but do not give a broad recommendation – the classifier decision is not predictable, audit trails are weak, and the US hosting is not appropriate for client data. In purely research-driven setups without PII, Martian can be a cost optimisation.

Why it matters conceptually

The core idea addresses a real problem. In typical LLM applications, most requests are simple (definition questions, short clarifications, fact queries) and only a small share is complex (multi-step reasoning, long synthesis, legal analysis). Anyone sending all requests to Claude Opus pays the premium token rate for simple requests. Anyone sending all to Mistral 7B gets weak answers on complex requests. A router that decides per request is the theoretically optimal solution.

Practice is harder. First: classifier accuracy. Categorising a request as "simple" or "complex" is itself a non-trivial classification problem. Martian publishes benchmarks showing about 15-25% cost savings versus a pure GPT-4o configuration – at comparable or slightly better answer quality. These numbers are plausible but do not transfer to every use case; in fiduciary practice with domain-specific requests, classification can be worse because the training set does not fit.

Second: audit weakness. When the router assigns one request to Claude Opus and a near-identical second to Mistral 7B, the decision reasoning is opaque. For an auditor under Art. 957a CO, "the classifier decided" is not a sufficient answer. Martian delivers per request a header with the chosen model and a confidence score, but compared to deterministic routing that is a clear audit gap.

Third: lock-in. The classifier profile per model is internal and proprietary. Anyone migrating from Martian to LiteLLM cannot reproduce the routing style 1:1. Anyone wanting to keep a deterministic model choice should not replace it with a classifier.

For revised Swiss FADP, Martian must be viewed critically. US hosting without EU region, no Swiss-standard data processing agreement, no self-host. For client PII, Martian is not suitable; for open research and marketing queries without PII, it is an option when you explicitly want to measure cost optimisation.

How it works

Onboarding runs via the Martian dashboard. After account creation and topping up USD credits, you get an API key (sk-martian-...). Integration is OpenAI-compatible:

import openai client = openai.OpenAI( api_key="sk-martian-...", base_url="https://withmartian.com/api/openai/v1" ) resp = client.chat.completions.create( model="router", # instead of a concrete model name messages=[{"role":"user","content":"Please explain the revised FADP obligations for fiduciary offices."}], extra_body={"router_constraints": {"max_cost_per_1m_tokens": 5, "allowed_providers": ["mistral", "anthropic"]}} ) print(resp.model, resp.choices[0].message.content)

The field model: router activates the classifier logic. router_constraints (a proprietary extra field) allow restrictions: maximum token price, permitted providers, latency budget. The response contains resp.model with the actually used model (e.g. mistral/mistral-large-2411).

The classifier logic has two modes. In routing-lite mode the embedding of the request is computed and compared with held clusters – cost USD 0.50 per 1M embedding tokens. In routing-full mode an additional lightweight classifier model runs, delivering a complexity prediction – cost USD 1-2 per 1M tokens. In both cases the final model is chosen accordingly.

Observability is built in. The dashboard shows per day a breakdown of chosen models, cost, and response latency. There is a compare mode where the same request is sent in parallel to multiple models (with cost surcharge) – that helps tune classifier profiles. Custom eval sets can be uploaded; the classifier is then fine-tuned on these sets.

Log retention sits by default at 30 days; export via API in CSV/JSON is possible. A WORM compliance layer is not built in – anyone needing audit trails for Art. 957a CO must mirror the logs to their own S3 with object lock.

Martian comparison pilot in 5 steps

01Open an account on withmartian.com, top up USD 50 credits, generate an API key (sk-martian-...).
02Compile a test set of 100-200 real requests without PII, label them in two groups (simple/complex).
03Run in parallel for one week: the application sends each request once to the Martian router and once to a static baseline model (e.g. Claude Sonnet).
04Evaluation: classifier decisions per category, cost difference, answer quality via LLM-as-judge or human eval on 30 samples.
05Decision: at real cost savings > 15% with comparable quality -> permanent use for non-PII workload; otherwise keep model choice static.

When Martian fits

First, for research setups that want to explicitly measure model-routing strategies. Anyone who wants to compare scientifically and rigorously whether classifier routing actually saves cost can run Martian as an A-B comparison against a static model choice.

Second, for high-volume applications with mixed complexity and without PII. Example: a marketing tool that generates 50,000 texts of varying depth per day. A classifier router can save 15-25% LLM cost here, which becomes relevant at monthly LLM spend of USD 5,000+.

Third, for test and comparison pipelines. In compare mode, the same request can be sent to multiple models – faster than a custom comparison stack with OpenRouter plus a self-written eval script.

Fourth, for academic setups where routing research is itself the subject. Anyone methodically studying classifier tuning and model selection gets a productive platform with built-in eval workflow in Martian.

In no case does Martian fit for client PII under revised Swiss FADP or Swiss Criminal Code Art. 321. The layer is wrong here – US hosting, classifier opacity, missing Swiss-standard data processing agreement.

When not to use

First, for client data under professional secrecy. US hosting without an EU tier makes every call a third-country transfer; without formal safeguards (DPA, SCCs, ZDR mode with certificate), that is off-limits for Art. 321 industries.

Second, when deterministic model routing is desired. Anyone who wants to know exactly which model answers per use case (e.g. for reproducibility, eval consistency, or audit) must not insert a classifier in between. LiteLLM or Portkey are the right choice here.

Third, when volume is under a few thousand requests per day. At low volume the USD markup per million tokens is practically irrelevant, while the classifier effort is constant. The ratio of effort and savings potential does not fit.

Fourth, when audit-trail requirements under Art. 957a CO are mandatory. The classifier decision is opaque; the auditor wants a deterministic justification that Martian does not provide by default. A custom explanation layer on top is effort that neutralises the router advantage.

Fifth, when lock-in must be avoided. The classifier profile is proprietary and not portable. A switch strategy back to deterministic routing requires re-conception.

Trade-offs

STRENGTHS

Classifier routing can save 15-25% cost at mixed complexity
OpenAI-compatible endpoint – minimal code effort for a pilot
Built-in compare mode for parallel model tests
Custom eval sets for domain-specific classifier fine-tuning

WEAKNESSES

US hosting without EU tier – not appropriate for client PII as of May 2026
Classifier decision opaque – audit under Art. 957a CO is harder
USD 0.50-2 per 1M tokens markup on top of the provider token price
Lock-in: classifier profiles are proprietary and not portable

FAQ

Does Martian work with the OpenAI library?

Yes, fully. base_url to https://withmartian.com/api/openai/v1, API key sk-martian-..., model: "router" for classifier routing or direct model name like anthropic/claude-opus-4.7. Streaming and function calling are supported depending on the upstream model.

How accurate is the classifier decision?

According to Martian benchmarks around 75-85% accuracy on published eval sets, with 15-25% cost savings at comparable answer quality versus a GPT-4o baseline. We could reproduce savings of 10-20% in pilot setups – accuracy depends strongly on the domain mix. For domain-specific Swiss fiduciary requests, classification is weaker than for generic chat applications.

Can I run Martian behind LiteLLM?

Technically yes: Martian can be entered in LiteLLM as a custom provider (openai-compatible, own base_url and api_key). Practically not recommended: two routing layers in sequence complicate debugging and neutralise the lock-in argument for LiteLLM (you become Martian-dependent anyway). Better: Martian standalone for non-PII research applications, LiteLLM for everything else.

How is data retention configurable?

Default 30-day storage of requests and responses for classifier improvement. With the X-Martian-No-Retention: true header, storage is disabled per request – then only metadata (token count, model, cost) is logged. A binding ZDR certificate in Swiss standard is not in place (as of May 2026).

Sources

Martian Documentation – router, model orchestration, constraints · 2026-05
Martian Pricing – routing modes, markup, retention · 2026-05
Martian Research papers on model routing · 2026-02
Martian Privacy Policy – data retention and US hosting · 2026-03

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call