Build Daily

Tinley Park · May 29, 2026
toolLangfuse GmbHwatching

Langfuse

Open-source LLM observability — traces every prompt, retrieval, and answer through a single SDK; aggregates scores from automated and human evals; self-hostable via Docker. The dashboard answer to "is this DSPy compile actually getting better over time?" Free self-hosted; paid cloud tier (skip).

Updated May 24, 2026

Langfuse is the trace + eval dashboard layer that turns "did this drafter improve?" from a vibes-question into a chart. Self-hostable in Docker; the SDK wraps every LLM call so you see the full chain. This page is the orient-and-set-up surface — official docs at langfuse.com/docs own the SDK contract.

What it is

A LLM-application observability platform. Open source (MIT) for the self-hosted version; commercial cloud + enterprise tiers exist (skip those). Three core surfaces:

  • Traces — every LLM call gets logged with input, output, model, latency, token cost. Multi-step chains nest naturally (a DSPy module that calls retrieval then synthesis shows up as a tree).
  • Scores — attach quality metrics to traces. Automated scores (rule-based, LLM-as-judge, structural rubrics) and human scores (review queue UI) both go to the same place. Filter and chart by version, by date, by metric name.
  • Datasets — store eval sets (input → expected output) and replay them against any version of a prompt or module; compare scores across runs.

The pitch: when "non-determinism is the regression suite," you need a queryable record of every run. Langfuse is the record.

When to use it

Reach for it when:

  • You're shipping more than one DSPy compile or fine-tune iteration and need to see whether each new version is better than the last.
  • You have an eval set and want the scores from re-running it persisted, charted, queryable — not just printed to stdout.
  • The agent layer makes multi-step calls (retrieval → rerank → synthesize) and you want the chain visible, not flattened into one log line per call.
  • You want a human review queue for selective draft scoring without standing up your own UI.

Skip it when:

  • The project is small enough that printing JSON to a log file covers the need.
  • You don't have an eval set yet — Langfuse without scores is just a fancy logger.
  • A single existing observability stack (Honeycomb, Datadog, etc.) already handles LLM traces and the marginal new surface isn't worth it.

At a glance

Core concepts

  • Trace — one user request from start to finish. Contains generations, spans, retrievals as nested children.
  • Generation — a single LLM call. Inputs, outputs, model, parameters, token counts, latency, cost.
  • Span — a non-LLM step (retrieval, rerank, tool call, postprocessor).
  • Score(trace_id, name, value, comment). The unit of "did this work."
  • Dataset — a versioned list of (input, expected_output) examples. Replayable.
  • Prompt — versioned prompts stored in Langfuse; production code fetches by name. Optional; DSPy compiles supersede this for compiled modules.

Distribution

  • Self-hosted (free, MIT) — Docker Compose for a full stack (Postgres + ClickHouse + web + worker). GL default.
  • Langfuse Cloud — managed, paid by ingest volume. Free tier exists but defeats the "stay local" benefit. Skip.
  • SDKs — Python and TypeScript, plus framework integrations (DSPy, LlamaIndex, LangChain, OpenAI SDK, Anthropic SDK).

How to integrate

Default integration order:

  1. Self-host. git clone https://github.com/langfuse/langfuse && docker compose up -d. Defaults to localhost:3000. Create a project; grab the LANGFUSE_PUBLIC_KEY + LANGFUSE_SECRET_KEY.
  2. Persist behind the existing infra surface. Don't expose :3000 publicly; reverse-proxy through the same Cloudflare Tunnel pattern the chat-bridge uses, gated by a token.
  3. Wire to DSPy. Native integration: dspy.configure(callbacks=[LangfuseCallback()]). Every compile run and every module call writes a trace + generation node automatically.
  4. Wire to LlamaIndex. from langfuse.llama_index import LlamaIndexInstrumentorLlamaIndexInstrumentor().start(). Every retrieval / rerank / synthesis step nests under the parent trace.
  5. Add scores. Each draft / chat answer goes through your eval metric; call langfuse.score(trace_id, name="post_system_rubric", value=...). Same call for LLM-as-judge or human review.
  6. Datasets. Upload the eval set once via langfuse.dataset.create_item(...). Re-run any version of a module against the dataset; scores accumulate.

In the GL stack

builddaily.io

  • Chat-bridge traces (slice 1). Every chat call writes a trace with: query embedding span → retrieval span → rerank span → answer-synthesis generation. When latency spikes or quality drops, the trace shows where.
  • Drafter eval persistence (slice 2). Every post_writer compile run + every draft generation writes a trace + score. The "structural rubric" and "voice match" metrics from the eval plan land here as charted scores over time.
  • Fine-tune A/B (slice 4). Two project tags, model:vanilla-8b and model:fine-tuned-8b. Run the same eval set against both; compare aggregate scores in one Langfuse chart. The flywheel either earns its keep on that chart or it doesn't.

paiddaily.io

  • Pendle classifier traces. Every catalyst classification call traces with the input announcement, the classified output, and the human-applied correction if any. The corrections become eval items.
  • Aerodrome pool risk explainer. Same shape — traces capture which signals fed which explanation; scores capture whether the explanation matched a held-out human read.

sagedaily.io

  • Per-reading traces. Every Oracle reading writes a trace with: deck draw → chart compute → DSPy generation. User feedback (thumbs / followup question) becomes a score.
  • Behavior-node mirror. Qualitative corrections still land in Neo4j Behavior nodes (canonical state); Langfuse holds the quantitative trace. Two systems, two layers, no duplication.

Gotchas

  • Self-host needs ~2GB RAM at idle. Postgres + ClickHouse + web + worker. Fine on a development laptop; budget the resources for a long-running daemon.
  • Trace volume scales with chain depth. A single agent call that fans out into 10 sub-calls writes 10 generations. Set sampling rules early if volume becomes a cost.
  • Scores are append-only. Re-scoring a trace creates a new score record; the old one stays. Filter on the latest by name when querying.
  • DSPy native integration is recent and minor versions drift. Pin both langfuse and dspy versions; budget a re-pin per upgrade.
  • The cloud tier is the pricing tier. Don't accidentally configure the SDK with the cloud host URL — point it at http://localhost:3000 (or your tunnel) for the self-hosted instance.

Risks

  • Single-vendor open-source. Langfuse GmbH owns the project. Open-source license means you keep the data and the binary even if they pivot — Postgres is the source of truth, you own it.
  • Trace storage growth. ClickHouse compresses well but a long-running production trace stream needs occasional pruning. Not a Day-1 concern.
  • Observability is not eval. Langfuse stores scores; it doesn't compute them. The metric function still has to be honest. Bad metric + clean dashboard = dressed-up vibes.

Related

  • DSPy — every compile and every module call lands in Langfuse with the native callback.
  • LlamaIndex — retrieval / rerank / synthesis spans nest under the parent trace via the LlamaIndex instrumentor.
  • Neo4j — qualitative state (Behavior nodes, Topic nodes) lives in Neo4j; quantitative traces live in Langfuse. Two systems, two jobs.