LANGFUSE · TECH

Langfuse: OSS market leader for LLM tracing, prompt management, and eval

Langfuse (MIT, v3+) is the OSS standard tool for LLM tracing, cost tracking, prompt versioning, and eval. Self-host or EU cloud Frankfurt.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is Langfuse?

Langfuse (langfuse.com) is the open-source platform for LLM observability with the largest market share in the OSS space. The project was founded in 2023, is MIT licensed (GitHub langfuse/langfuse, over 24,000 stars as of May 2026), and was incubated in YC-W23. Behind the project is the German company Langfuse GmbH (Berlin) with Series A funding of USD 4M in summer 2024 and a Series B round in preparation as of Q1 2026.

The platform covers four building blocks. First, tracing: every LLM call is recorded as a span with prompt, response, tokens, latency, and model. Multi-step agent calls (LLM calls tool calls LLM calls database calls LLM) appear as a call tree with clear parent-child hierarchy. Second, cost tracking: tokens are calculated against current provider price lists, aggregated by client, application, function, model. Third, prompt management: prompts exist as versioned artefacts, deployable per environment, comparable against eval sets. Fourth, evaluation: against gold standards, LLM-as-judge, or custom heuristics, whether new prompts or models are better is measured.

Architecture: as of May 2026, version 3.x is stable with ClickHouse as the primary logging backend (instead of the earlier PostgreSQL-only setup) and S3-compatible object storage for prompt artefacts. That makes Langfuse scalable arbitrarily – production installations see tens of thousands of traces per hour without performance problems.

Deployment variants: First, Langfuse Cloud (langfuse.com) with EU region in Frankfurt (on AWS eu-central-1). Free tier up to 50,000 traces/month, Pro tier from USD 59/month, Team tier from USD 199/month, Enterprise tier on request. Second, self-host: Docker Compose stack with langfuse-web, langfuse-worker, ClickHouse, Postgres, MinIO, or S3. A full production installation on Hetzner CPX21/CPX31 runs in one day.

For fairlane.systems, Langfuse is the standard observability tool. We run Langfuse self-host for several mandates on Hetzner and use Langfuse Cloud EU Frankfurt as a backup option for pilot projects without hardware budget.

Why it is the standard tool

Three properties explain the market position. First: MIT licence with full self-host capability. In contrast to tools like LangSmith (proprietary, US-only) or W&B Weave (proprietary, ML-centric), Langfuse lives fully in OSS. Anyone wanting the tool forkable tomorrow can do that any time. Migration away from Langfuse is more trivial than from proprietary platforms.

Second: EU hosting plus self-host. Langfuse Cloud has a Frankfurt region (AWS eu-central-1) where all data – prompts, responses, traces – is processed exclusively in the EU. Self-host runs on any hardware in CH or EU. That makes Langfuse the only one of the major LLM tracing tools usable without discussion for revised Swiss FADP-strict setups.

Third: feature scope as a winning sweet spot. Langfuse delivers tracing plus cost plus prompt management plus eval in one platform – other tools have individual blocks but rarely all together. Anyone running a RAG pipeline with 8 agent steps sees the call hierarchy as a tree, can compare prompt versions, and at the same time review eval metrics over time. Helicone has the cost part weaker, Lunary is functionally more minimal, Phoenix comes from the ML drift camp and fits LLM production less well.

For Swiss fiduciary and law-firm setups, Langfuse brings two additional advantages. First, prompt versioning as an audit trail: every change to a system prompt is traceable with diff, author, and deployment date. The question "Which prompt was active on April 12?" has a deterministic answer. Second, eval sets as quality assurance: before a model switch, the new model candidate can be evaluated against 50 real client questions – without risking loss of answer quality.

Limits: Langfuse is SDK-based, not proxy-based. Code must be adjusted (wrap LLM calls in a Langfuse wrapper) – that is more effort than Helicones base URL change. In setups where code change is undesired, Helicone proxy is the faster alternative.

How it works

Integration runs via an SDK in the application programming language (Python, TypeScript, Go). In the Python example:

from langfuse import Langfuse from langfuse.openai import openai # drop-in replacement

langfuse = Langfuse( public_key="pk-lf-...", secret_key="sk-lf-...", host="https://eu.cloud.langfuse.com" # or self-host URL )

client = openai.OpenAI() # automatic tracing active resp = client.chat.completions.create( model="mistral-large-2411", messages=[{"role":"user","content":"..."}], metadata={"client": "tenant-12", "function": "rag-search"} )

The metadata fields appear in the Langfuse dashboard as filterable dimensions. For agent tracing an @observe() decorator goes on functions; nested calls are automatically connected as parent-child spans.

Prompt management runs via the dashboard plus SDK. Prompts are created in the dashboard as versioned artefacts (name, version, body, variables). In code the prompt is fetched by ID or tag – a prompt switch is a configuration in the dashboard, not a code change. Example:

prompt = langfuse.get_prompt("rag-system", label="production") formatted = prompt.compile(client_name="Client A", language="en")

That guarantees reproducibility: every response knows which prompt version was used.

Evaluation runs either as LLM-as-judge (a second model evaluates the response against a gold standard) or via custom score functions (regex match, BLEU score, custom heuristic). Eval sets are managed in the dashboard; a run executes as a batch against a prompt and delivers metrics per response.

For self-host the following containers run: langfuse-web (UI, Next.js), langfuse-worker (background jobs in TypeScript), ClickHouse (traces), Postgres (configuration and metadata), MinIO or S3 (prompt archives). Recommended Hetzner configuration: CPX31 with 4 vCPU/8 GB RAM for medium load (CHF 25/month), CPX41 with 8 vCPU/16 GB RAM for high load (CHF 50/month). Backups to Hetzner Storage Box with append-only mode deliver WORM compliance for Art. 957a CO.

Langfuse setup in 5 steps

01Deploy Langfuse self-host on Hetzner CPX31 (Docker Compose with langfuse-web, worker, ClickHouse, Postgres, MinIO) or create a Langfuse Cloud EU Frankfurt account.
02Generate API keys (pk-lf-..., sk-lf-...), embed SDK in applications (Python: pip install langfuse, drop-in openai replacement).
03Define metadata schema: per call set client, application, function, environment fields – for later filters and cost reports.
04Migrate prompts to the repository: replace existing hardcoded strings in code with langfuse.get_prompt() and version tag.
05Create eval sets: 30-50 real requests with gold-standard responses, configure LLM-as-judge scoring, plan regular runs after prompt changes.

When Langfuse fits

First, for all productive LLM applications with more than occasional use. As soon as an application serves more than a few hundred requests per day, observability is a duty – not a nice-to-have. Langfuse self-host on Hetzner covers this requirement at server cost of CHF 25/month.

Second, for RAG pipelines and agent workflows. Anyone building an agent with multiple tool calls, RAG retrievals, and multi-step reasoning needs trace trees for debug capability. Without Langfuse or LangSmith every debug session is blind.

Third, when prompt versioning should be taken seriously. As soon as more than one person works on prompts, diff view, deployment per environment, and rollback capability are central. Langfuses prompt repository is one of the cleanest solutions on the market here.

Fourth, for eval-driven model/prompt choice. Before migrating from Claude Sonnet to Mistral Large, you want to know whether answer quality remains stable on the real client corpus. Langfuse eval sets allow that measurement – A-B test against gold standard, metrics per model candidate.

Fifth, for Swiss mandates with hard data-residency requirements. Langfuse self-host on Swiss/EU hardware is the only configuration in which all prompts, responses, and traces remain under your control – no US cloud layer in between.

Sixth, as a complementary layer to routing gateways. A standard configuration: LiteLLM for routing and virtual keys, Langfuse for observability and prompt management, both tools run in parallel. LiteLLM has a Langfuse callback hook – every call is automatically recorded in Langfuse.

When not to use

First, in extremely small setups with under 1,000 calls/month. A single ChatGPT wrapper for internal staff notes needs no dedicated tracking tool. The OpenAI usage dashboard plus some Postgres logs is enough.

Second, when code changes are absolutely forbidden. Langfuse is SDK-based – code must wrap LLM calls in the Langfuse wrapper or set the @observe() decorator. Anyone unable to (e.g. due to contractual code-freeze clauses) uses Helicone as a proxy alternative without code change.

Third, when the team lacks container/Docker knowledge and no cloud budget is allocated. Langfuse self-host needs a cleanly maintained container stack with backup, monitoring, and update discipline. Anyone who cannot deliver that uses Langfuse Cloud EU Frankfurt – but forgoes the self-host advantages.

Fourth, for setups that exclusively want Prometheus/Grafana-based monitoring without a separate UI. Anyone wanting to export LLM telemetry only as OpenTelemetry spans into an existing monitoring backend uses OpenLLMetry (SDK from Traceloop) and SigNoz or Grafana Tempo. Langfuse brings its own UI that runs in parallel.

Fifth, at very tight latency budgets where no additional SDK overhead is tolerated. The Langfuse SDK adds typically 1-3 ms per call (background tracing); for voice bots with 200 ms budget that is acceptable, for extremely latency-critical sub-100-ms applications you should measure.

Trade-offs

STRENGTHS

MIT licence, fully self-hostable with an S3-scalable backend
EU region Frankfurt in the cloud tier – clean Swiss FADP configuration
Tracing plus cost plus prompt versioning plus eval in one platform
Market leader in OSS LLM observability with an active community

WEAKNESSES

SDK-based – code change required (no proxy mode like Helicone)
Self-host needs ClickHouse plus Postgres plus S3 – more complex stack
Steep eval learning curve in LLM-as-judge setups with custom score functions
No built-in guardrails (PII filter, toxicity, prompt-injection detection)

FAQ

What are the real self-host costs?

Hardware: Hetzner CPX31 (4 vCPU, 8 GB RAM, around CHF 25/month) covers tens of thousands of traces per month. At larger volume CPX41 with 8 vCPU/16 GB (CHF 50/month). Hetzner Storage Box for backups CHF 10-20/month. Setup effort: 1-2 days. Maintenance: around 2-3 hours per month. First-year total including setup at Swiss hourly rate: around CHF 3,000-5,000. Langfuse Cloud Pro in comparison: USD 59/month plus volume surcharge – around USD 700-1,500/year. Cloud wins in year one, self-host from year two.

How does Langfuse v3 differ from v2?

Main change: ClickHouse instead of PostgreSQL as the primary logging backend. That enables scaling into millions of traces per month without a Postgres bottleneck. In addition, S3-compatible object storage for prompt artefacts, improved eval workflows, and new UI components. Migration from v2 to v3 is documented but needs planned downtime of 2-4 hours depending on data volume.

Does Langfuse work behind LiteLLM or other gateways?

Yes. LiteLLM has a built-in Langfuse callback hook: in the config.yaml langfuse is entered as success_callback, every call is automatically mirrored to Langfuse. Helicone, Portkey, and Kong AI Gateway can also be combined with Langfuse – two ways: webhook from gateway to Langfuse or the application sends in parallel to both. The combination LiteLLM (routing) plus Langfuse (observability) is our standard for Swiss mandates.

Is Langfuse audit-grade for Art. 957a CO?

Conceptually yes. Langfuse writes every trace with timestamp, model, tokens, cost, prompt version, and response hash to ClickHouse. Backups run via pg_dump and ClickHouse backup mechanisms to S3 Object Lock or Hetzner Storage Box with append-only. This configuration is WORM-compliant and thus suitable for Art. 957a CO. Important: WORM must be actively configured – a default install writes updatable ClickHouse records without an audit-mandatory layer.

Sources

Langfuse Documentation – tracing, prompts, eval, self-host · 2026-05
Langfuse GitHub repository – MIT licence, v3+ source · 2026-05
Langfuse Cloud Pricing and EU Region Frankfurt · 2026-05
Langfuse v3 Architecture announcement – ClickHouse + S3 · 2026-02

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call