LLM OBSERVABILITY / TOOL COMPARISON

LLM observability compared: Langfuse, Helicone, LangSmith, Phoenix, Lunary, Portkey, OpenLLMetry, Traceloop, HoneyHive, W&B Weave

Ten specialised tools for tracing, cost tracking, prompt versioning, and evaluation of LLM pipelines. Seven decision axes, one concrete recommendation per scenario. As of May 2026.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is LLM observability?

LLM observability is the discipline that makes every call to a language model, every tool call from an agent, and every RAG retrieval hop traceable. Classic application telemetry (Prometheus, Datadog) is not enough here. It sees that an HTTP call took 1.8 seconds, but it does not see that Claude Opus called 24 tools in the process, three with invalid arguments, and that the final answer ended on a hallucination path. That is exactly what LLM observability tools are built for.

Four building blocks delimit the category. First, tracing: every call is recorded as a span, including prompt, answer, token usage, latency, model version. Multi-step agent calls are shown as a tree. Second, cost tracking: tokens are converted to prices and aggregated per client, model, function. Third, prompt versioning: prompts are kept as versioned artefacts, not as hardcoded strings in code. Fourth, evaluation: against gold standards or heuristics, whether new model or prompt versions are better or worse is measured.

In May 2026 the field has consolidated. Langfuse has emerged as the OSS market leader (24,000 GitHub stars, EU cloud in Frankfurt), Helicone and Lunary are the convenient entrants, LangSmith dominates the LangChain camp, OpenLLMetry plugs LLM telemetry into the standard OpenTelemetry path. We run Langfuse and Helicone in production at several mandates - that experience feeds directly into this comparison.

Why the choice matters

Three factors make LLM observability a duty rather than a nice-to-have for Swiss SMEs. First: cost drift. A RAG pipeline with Claude Sonnet costs a few cents per query. A single bug in the prompt template can inflate the context - for instance, because a loop accidentally pushes 200 chunks into the prompt instead of 8 - and explode monthly cost by a factor of 25. Without per-application and per-model cost tracking, this only shows up on the LLM provider invoice, by which time it is too late.

Second: hallucination audit. When a client asks "Why did the system tell me my VAT rate is 8.1%?", the answer must be audit-ready: which prompt was used, which sources were supplied, which model answered. That is exactly the tracing artefact of an LLM observability platform. Without that recording, every error turns into a "the system just said so" discussion.

Third: data residency. Prompts and answers often contain sensitive client data. A US-hosted LLM tracking tool that stores prompts is problematic under revDSG and, depending on mandate, under SCC Art. 321. EU hosting is decisive here - and that narrows the list noticeably. Langfuse Cloud offers a Frankfurt region, Helicone has EU hosting, Portkey too. LangSmith, HoneyHive, and Weave run primarily in the US. For self-host preferences, Langfuse, Helicone, Phoenix, Lunary, and SigNoz with OpenLLMetry can all be run fully on-prem.

The ten tools in detail

Langfuse (MIT, self-host plus cloud, EU region Frankfurt): the OSS market leader. May 2026 in version 3.x with S3-based storage for arbitrary scale. Cost tracking per model, user, tenant. Prompt management with versioning, tagging, A/B testing. Eval framework with LLM-as-judge and custom scoring. We run Langfuse self-host on Hetzner for several mandates. Clear default recommendation.

Helicone (Apache 2.0, self-host plus cloud, EU hosting available): proxy-based or SDK. Setup in under ten minutes - just change the LLM API base URL and all calls flow through Helicone. Convenient entry, good cost caps and rate limits built in. Less extensive eval framework than Langfuse, but quicker to get started.

LangSmith (proprietary cloud, US): the LangChain community first tool. Anyone using LangChain anyway has the deepest integration here. Very mature eval framework. Drawback: cloud only, US hosting (no EU tier as of May 2026), tied to the LangChain ecosystem.

Phoenix (Arize) (Elastic-2.0 OSS plus Arize cloud): from the ML observability camp. Open-source variant fully self-host capable, covers embeddings visualisation and drift detection - stronger than pure LLM tools here. Arize cloud variant for enterprise with enterprise price list.

Lunary (Apache 2.0 plus cloud): simple OSS alternative. Functionally smaller than Langfuse, but very lean and quick to set up. Good for mandates that only need cost tracking and basic traces, no eval framework.

Portkey (proprietary cloud plus self-host tier, EU hosting): combines gateway plus observability. Routes between providers (OpenAI, Anthropic, Google) while tracking simultaneously. Pro: fewer components in the stack. Con: lock-in, because gateway and observability are decided together.

OpenLLMetry (Apache 2.0, SDK from Traceloop): no backend of its own, but an SDK layer that exports LLM telemetry as OpenTelemetry spans. That means every OTLP-capable backend (SigNoz, Grafana Tempo, Datadog, Honeycomb) can receive LLM traces. Standardisation on semantic conventions for GenAI. Best choice when the rest of the stack already speaks OpenTelemetry.

Traceloop (MIT SDK plus proprietary backend, self-host plus cloud): the company behind OpenLLMetry. Own backend with eval and prompt management, but also fully consumable via OTLP. Dual strategy: OSS SDK for connection, cloud for the comfort backend.

HoneyHive (proprietary cloud, US): AI eval and tracing focused, with a strong eval workflow for production setups. As of May 2026 hosted in the US - only conceivable for Swiss mandates with revDSG requirements via a data processing agreement. Strict eval methodology.

Weights & Biases Weave (proprietary cloud plus OSS SDK): W&B has been the ML tracking standard for years and built Weave as the LLM extension. Very good when the team uses W&B for ML experiments anyway. Otherwise overkill - the setup is large for pure LLM tracing.

Selection in six steps

01Estimate volume: how many LLM calls per day, which models, average token count? Under 1000 calls/day free tier is enough.
02Check data residency: must prompts stay in CH/EU? If yes, LangSmith and HoneyHive are out, Langfuse or Helicone EU or self-host.
03Clarify eval need: do I only need tracing and cost, or also an eval framework? If eval is needed: Langfuse, LangSmith, HoneyHive, Phoenix.
04Check OpenTelemetry strategy: does the rest of the stack already speak OTLP? Then OpenLLMetry plus existing backend.
05Gateway question: do I also need multi-provider routing? If yes, Portkey or LiteLLM plus Langfuse as a separate component.
06PoC in one application: instrument one productive pipeline for a week, review cost reports and trace detail depth. Only then roll out to all pipelines.

Recommendation by scenario

Swiss fiduciary or law firm with a RAG pipeline, revDSG-strict: Langfuse self-host on Hetzner Falkenstein. A CPX21 server (3 vCPU, 4 GB RAM, around CHF 12/month) covers tens of thousands of traces per month. Postgres and ClickHouse as backend. Setup in one day. All prompts and traces stay in the EU.

Fast start, SME without self-host appetite: Helicone EU cloud. Proxy mode, change API base URL, done. Cost starts at USD 20/month plus per 1000 requests. Cost caps and rate limits built in - useful for protection against pipeline bugs.

LangChain-first setup: LangSmith. If the code already uses LangChain or LangGraph, LangSmith is active without configuration. But: US hosting must be signed off in the data flow analysis.

OpenTelemetry-first strategy: OpenLLMetry SDK plus SigNoz (or Grafana Tempo). LLM traces get recorded as normal OTLP spans in the same backend as the rest of the stack. Saves a separate tool category.

Multi-provider setup, gateway plus observability in one: Portkey EU region. Routes between OpenAI, Anthropic, Google, Mistral and tracks simultaneously. Worth it if a gateway is needed anyway. Alternative: LiteLLM plus Langfuse as separate components - more flexible, more setup.

ML team that already uses W&B: Weave. Clear integration with existing W&B experiments. Alternative: Phoenix, when drift detection and embeddings analysis are central.

When LLM observability is overkill

Anyone running a single ChatGPT wrapper with fewer than 1000 requests per month does not need a dedicated tracking tool. The OpenAI usage dashboard and a few logs are enough. Likewise, anyone still in the experimentation phase, only testing prototype calls, should save the effort until the pipeline goes live.

The typical mistake in tool selection is picking the expensive cloud variant too early. LangSmith for five beta users is wasted money. Same: HoneyHive or W&B Weave for a fiduciary office with two productive pipelines is overengineering. Rule of thumb: if monthly LLM cost is under CHF 50, Langfuse cloud hobby tier (free) or Helicone free is enough. Only at CHF 200 and up does self-host effort or a paid tier pay off.

Be careful mixing several observability tools on the same stack. We have seen mandates running LangSmith, Langfuse Cloud, and Helicone in parallel, "because each one does something different". Result: three sources of truth, three UIs, threefold cost reporting effort. We recommend one primary system per pipeline plus possibly OpenLLMetry as the standardisation layer.

Trade-offs

STRENGTHS

Langfuse: OSS market leader, EU cloud Frankfurt, eval framework included
Helicone: 10-minute setup via proxy, cost caps and rate limits included
OpenLLMetry: OpenTelemetry standard, any OTLP backend usable
LangSmith: deepest integration with LangChain/LangGraph
Portkey: gateway plus observability in one component

WEAKNESSES

LangSmith: US hosting only (as of May 2026), tied to LangChain
HoneyHive: US hosting, no EU tier
W&B Weave: large for pure LLM tracing, only worth it with an ML stack
Helicone: fewer eval functions than Langfuse
Portkey: lock-in, gateway and observability decided as a package

FAQ

What does Langfuse self-host realistically cost?

Hardware: a Hetzner CPX21 covers tens of thousands of traces per month - around CHF 12/month. For larger volume a CPX31 (4 vCPU, 8 GB RAM, around CHF 25/month). External storage on an S3-compatible backend (Hetzner Object Storage, MinIO). Setup effort: one to two days. Maintenance about two hours per month. First-year total including setup at market hourly rate: around CHF 3000 to 5000. Langfuse Cloud Pro at comparable scope: from USD 59/month, around CHF 700/year - cloud wins in year one, self-host from year two.

Do I really need prompt versioning?

Once more than one person works on prompts, yes. Prompts are configuration, not code detail - and configuration belongs versioned. Concretely: an update to the system prompt of a RAG pipeline can flip answer quality. Without versioning, "what was the old prompt?" later stands without an answer. Langfuse, LangSmith, and Portkey support this natively. Helicone and Lunary more basically, OpenLLMetry not at all (that is an SDK, not a backend).

How do Helicone and Langfuse practically differ?

Helicone is a proxy: you change the LLM API base URL to https://oai.helicone.ai/v1 instead of https://api.openai.com/v1, and all calls flow through automatically. Zero code change. Langfuse is SDK-based: you wrap OpenAI calls in a Langfuse wrapper. More setup, but finer control over traces, user tagging, session tracking. Rule of thumb: Helicone for quick cost overview, Langfuse for serious tracing including multi-step agents, prompt management, eval.

Can I send LLM traces to my existing Datadog or Grafana?

Yes, with OpenLLMetry. The SDK exports LLM spans in OpenTelemetry format. Every OTLP-capable backend (Datadog, Grafana Tempo, SigNoz, Honeycomb, New Relic) can receive them. Pro: no second UI. Con: specialised cost reports and prompt versioning are missing - those are noticeably stronger in dedicated LLM tools (Langfuse etc.). We recommend the combination for stacks with an existing OTLP pipeline.

Sources

Langfuse Documentation - Open-source LLM engineering platform · 2026-05
Helicone Documentation - LLM observability proxy · 2026-04
OpenLLMetry - OpenTelemetry semantic conventions for GenAI · 2026-04
LangSmith Documentation - LangChain observability · 2026-04
Arize Phoenix - open-source ML and LLM observability · 2026-03

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call