toolLangfuse GmbHwatching

Langfuse

Open-source LLM observability — traces every prompt, retrieval, and answer through a single SDK; aggregates scores from automated and human evals; self-hostable via Docker. The dashboard answer to "is this DSPy compile actually getting better over time?" Free self-hosted; paid cloud tier (skip).

Updated May 24, 2026

Langfuse is the trace + eval dashboard layer that turns "did this drafter improve?" from a vibes-question into a chart. Self-hostable in Docker; the SDK wraps every LLM call so you see the full chain. This page is the orient-and-set-up surface — official docs at langfuse.com/docs own the SDK contract.

What it is

A LLM-application observability platform. Open source (MIT) for the self-hosted version; commercial cloud + enterprise tiers exist (skip those). Three core surfaces:

Traces — every LLM call gets logged with input, output, model, latency, token cost. Multi-step chains nest naturally (a DSPy module that calls retrieval then synthesis shows up as a tree).
Scores — attach quality metrics to traces. Automated scores (rule-based, LLM-as-judge, structural rubrics) and human scores (review queue UI) both go to the same place. Filter and chart by version, by date, by metric name.
Datasets — store eval sets (input → expected output) and replay them against any version of a prompt or module; compare scores across runs.

The pitch: when "non-determinism is the regression suite," you need a queryable record of every run. Langfuse is the record.

When to use it

Reach for it when:

You're shipping more than one DSPy compile or fine-tune iteration and need to see whether each new version is better than the last.
You have an eval set and want the scores from re-running it persisted, charted, queryable — not just printed to stdout.
The agent layer makes multi-step calls (retrieval → rerank → synthesize) and you want the chain visible, not flattened into one log line per call.
You want a human review queue for selective draft scoring without standing up your own UI.

Skip it when:

The project is small enough that printing JSON to a log file covers the need.
You don't have an eval set yet — Langfuse without scores is just a fancy logger.
A single existing observability stack (Honeycomb, Datadog, etc.) already handles LLM traces and the marginal new surface isn't worth it.

At a glance

Core concepts

Trace — one user request from start to finish. Contains generations, spans, retrievals as nested children.
Generation — a single LLM call. Inputs, outputs, model, parameters, token counts, latency, cost.
Span — a non-LLM step (retrieval, rerank, tool call, postprocessor).
Score — (trace_id, name, value, comment). The unit of "did this work."
Dataset — a versioned list of (input, expected_output) examples. Replayable.
Prompt — versioned prompts stored in Langfuse; production code fetches by name. Optional; DSPy compiles supersede this for compiled modules.

Distribution

Self-hosted (free, MIT) — Docker Compose for a full stack (Postgres + ClickHouse + web + worker). GL default.
Langfuse Cloud — managed, paid by ingest volume. Free tier exists but defeats the "stay local" benefit. Skip.
SDKs — Python and TypeScript, plus framework integrations (DSPy, LlamaIndex, LangChain, OpenAI SDK, Anthropic SDK).

How to integrate

Default integration order:

Self-host. git clone https://github.com/langfuse/langfuse && docker compose up -d. Defaults to localhost:3000. Create a project; grab the LANGFUSE_PUBLIC_KEY + LANGFUSE_SECRET_KEY.
Persist behind the existing infra surface. Don't expose :3000 publicly; reverse-proxy through the same Cloudflare Tunnel pattern the chat-bridge uses, gated by a token.
Wire to DSPy. Native integration: dspy.configure(callbacks=[LangfuseCallback()]). Every compile run and every module call writes a trace + generation node automatically.
Wire to LlamaIndex. from langfuse.llama_index import LlamaIndexInstrumentor → LlamaIndexInstrumentor().start(). Every retrieval / rerank / synthesis step nests under the parent trace.
Add scores. Each draft / chat answer goes through your eval metric; call langfuse.score(trace_id, name="post_system_rubric", value=...). Same call for LLM-as-judge or human review.
Datasets. Upload the eval set once via langfuse.dataset.create_item(...). Re-run any version of a module against the dataset; scores accumulate.

In the GL stack

builddaily.io

Chat-bridge traces (slice 1). Every chat call writes a trace with: query embedding span → retrieval span → rerank span → answer-synthesis generation. When latency spikes or quality drops, the trace shows where.
Drafter eval persistence (slice 2). Every post_writer compile run + every draft generation writes a trace + score. The "structural rubric" and "voice match" metrics from the eval plan land here as charted scores over time.
Fine-tune A/B (slice 4). Two project tags, model:vanilla-8b and model:fine-tuned-8b. Run the same eval set against both; compare aggregate scores in one Langfuse chart. The flywheel either earns its keep on that chart or it doesn't.

paiddaily.io

Pendle classifier traces. Every catalyst classification call traces with the input announcement, the classified output, and the human-applied correction if any. The corrections become eval items.
Aerodrome pool risk explainer. Same shape — traces capture which signals fed which explanation; scores capture whether the explanation matched a held-out human read.

sagedaily.io

Per-reading traces. Every Oracle reading writes a trace with: deck draw → chart compute → DSPy generation. User feedback (thumbs / followup question) becomes a score.
Behavior-node mirror. Qualitative corrections still land in Neo4j Behavior nodes (canonical state); Langfuse holds the quantitative trace. Two systems, two layers, no duplication.

Gotchas

Self-host needs ~2GB RAM at idle. Postgres + ClickHouse + web + worker. Fine on a development laptop; budget the resources for a long-running daemon.
Trace volume scales with chain depth. A single agent call that fans out into 10 sub-calls writes 10 generations. Set sampling rules early if volume becomes a cost.
Scores are append-only. Re-scoring a trace creates a new score record; the old one stays. Filter on the latest by name when querying.
DSPy native integration is recent and minor versions drift. Pin both langfuse and dspy versions; budget a re-pin per upgrade.
The cloud tier is the pricing tier. Don't accidentally configure the SDK with the cloud host URL — point it at http://localhost:3000 (or your tunnel) for the self-hosted instance.

Risks

Single-vendor open-source. Langfuse GmbH owns the project. Open-source license means you keep the data and the binary even if they pivot — Postgres is the source of truth, you own it.
Trace storage growth. ClickHouse compresses well but a long-running production trace stream needs occasional pruning. Not a Day-1 concern.
Observability is not eval. Langfuse stores scores; it doesn't compute them. The metric function still has to be honest. Bad metric + clean dashboard = dressed-up vibes.

Alternatives · 5 substitutesPick Langfuse unless one of these wins on your specific brief.

01
LangSmith · LangChain Inc.
Hosted, closed-source observability tied to LangChain.
Wins when ▸the project is already deep into LangChain. Trades open-source + self-host for tighter LangChain integration. Paid; not the GL path.
02
Phoenix · Arize AI
Open-source LLM tracing + eval; OpenTelemetry-native.
Wins when ▸you're already running OpenTelemetry and want LLM traces in the same surface. Different mental model — Phoenix leans into OTel conventions; Langfuse stands alone.
03
Helicone
Proxy-based LLM observability — point your SDK base URL at it.
Wins when ▸you want zero-code observability — just change the API base URL. Trades trace depth (no nested spans) for setup simplicity. Mostly hosted.
04
OpenLLMetry
OpenTelemetry semantic conventions for LLM calls.
Wins when ▸you want LLM telemetry in your existing OTel-native observability stack (Honeycomb, Datadog, Grafana). A library, not a backend — you still need a destination.
05
Hand-rolled JSON logs
Structured log lines + grep + a notebook for analysis.
Wins when ▸volume is low, queries are rare, no review UI needed. Graduates to Langfuse when you find yourself writing the same notebook against the same JSON for the third time.

DSPy — every compile and every module call lands in Langfuse with the native callback.
LlamaIndex — retrieval / rerank / synthesis spans nest under the parent trace via the LlamaIndex instrumentor.
Neo4j — qualitative state (Behavior nodes, Topic nodes) lives in Neo4j; quantitative traces live in Langfuse. Two systems, two jobs.