Langfuse
Open-source LLM observability — traces every prompt, retrieval, and answer through a single SDK; aggregates scores from automated and human evals; self-hostable via Docker. The dashboard answer to "is this DSPy compile actually getting better over time?" Free self-hosted; paid cloud tier (skip).
Langfuse is the trace + eval dashboard layer that turns "did this drafter improve?" from a vibes-question into a chart. Self-hostable in Docker; the SDK wraps every LLM call so you see the full chain. This page is the orient-and-set-up surface — official docs at langfuse.com/docs own the SDK contract.
What it is
A LLM-application observability platform. Open source (MIT) for the self-hosted version; commercial cloud + enterprise tiers exist (skip those). Three core surfaces:
- Traces — every LLM call gets logged with input, output, model, latency, token cost. Multi-step chains nest naturally (a DSPy module that calls retrieval then synthesis shows up as a tree).
- Scores — attach quality metrics to traces. Automated scores (rule-based, LLM-as-judge, structural rubrics) and human scores (review queue UI) both go to the same place. Filter and chart by version, by date, by metric name.
- Datasets — store eval sets (input → expected output) and replay them against any version of a prompt or module; compare scores across runs.
The pitch: when "non-determinism is the regression suite," you need a queryable record of every run. Langfuse is the record.
When to use it
Reach for it when:
- You're shipping more than one DSPy compile or fine-tune iteration and need to see whether each new version is better than the last.
- You have an eval set and want the scores from re-running it persisted, charted, queryable — not just printed to stdout.
- The agent layer makes multi-step calls (retrieval → rerank → synthesize) and you want the chain visible, not flattened into one log line per call.
- You want a human review queue for selective draft scoring without standing up your own UI.
Skip it when:
- The project is small enough that printing JSON to a log file covers the need.
- You don't have an eval set yet — Langfuse without scores is just a fancy logger.
- A single existing observability stack (Honeycomb, Datadog, etc.) already handles LLM traces and the marginal new surface isn't worth it.
At a glance
Core concepts
- Trace — one user request from start to finish. Contains generations, spans, retrievals as nested children.
- Generation — a single LLM call. Inputs, outputs, model, parameters, token counts, latency, cost.
- Span — a non-LLM step (retrieval, rerank, tool call, postprocessor).
- Score —
(trace_id, name, value, comment). The unit of "did this work." - Dataset — a versioned list of
(input, expected_output)examples. Replayable. - Prompt — versioned prompts stored in Langfuse; production code fetches by name. Optional; DSPy compiles supersede this for compiled modules.
Distribution
- Self-hosted (free, MIT) — Docker Compose for a full stack (Postgres + ClickHouse + web + worker). GL default.
- Langfuse Cloud — managed, paid by ingest volume. Free tier exists but defeats the "stay local" benefit. Skip.
- SDKs — Python and TypeScript, plus framework integrations (DSPy, LlamaIndex, LangChain, OpenAI SDK, Anthropic SDK).
How to integrate
Default integration order:
- Self-host.
git clone https://github.com/langfuse/langfuse && docker compose up -d. Defaults tolocalhost:3000. Create a project; grab theLANGFUSE_PUBLIC_KEY+LANGFUSE_SECRET_KEY. - Persist behind the existing infra surface. Don't expose
:3000publicly; reverse-proxy through the same Cloudflare Tunnel pattern the chat-bridge uses, gated by a token. - Wire to DSPy. Native integration:
dspy.configure(callbacks=[LangfuseCallback()]). Every compile run and every module call writes a trace + generation node automatically. - Wire to LlamaIndex.
from langfuse.llama_index import LlamaIndexInstrumentor→LlamaIndexInstrumentor().start(). Every retrieval / rerank / synthesis step nests under the parent trace. - Add scores. Each draft / chat answer goes through your eval metric; call
langfuse.score(trace_id, name="post_system_rubric", value=...). Same call for LLM-as-judge or human review. - Datasets. Upload the eval set once via
langfuse.dataset.create_item(...). Re-run any version of a module against the dataset; scores accumulate.
In the GL stack
builddaily.io
- Chat-bridge traces (slice 1). Every chat call writes a trace with: query embedding span → retrieval span → rerank span → answer-synthesis generation. When latency spikes or quality drops, the trace shows where.
- Drafter eval persistence (slice 2). Every
post_writercompile run + every draft generation writes a trace + score. The "structural rubric" and "voice match" metrics from the eval plan land here as charted scores over time. - Fine-tune A/B (slice 4). Two project tags,
model:vanilla-8bandmodel:fine-tuned-8b. Run the same eval set against both; compare aggregate scores in one Langfuse chart. The flywheel either earns its keep on that chart or it doesn't.
paiddaily.io
- Pendle classifier traces. Every catalyst classification call traces with the input announcement, the classified output, and the human-applied correction if any. The corrections become eval items.
- Aerodrome pool risk explainer. Same shape — traces capture which signals fed which explanation; scores capture whether the explanation matched a held-out human read.
sagedaily.io
- Per-reading traces. Every Oracle reading writes a trace with: deck draw → chart compute → DSPy generation. User feedback (thumbs / followup question) becomes a score.
- Behavior-node mirror. Qualitative corrections still land in Neo4j Behavior nodes (canonical state); Langfuse holds the quantitative trace. Two systems, two layers, no duplication.
Gotchas
- Self-host needs ~2GB RAM at idle. Postgres + ClickHouse + web + worker. Fine on a development laptop; budget the resources for a long-running daemon.
- Trace volume scales with chain depth. A single agent call that fans out into 10 sub-calls writes 10 generations. Set sampling rules early if volume becomes a cost.
- Scores are append-only. Re-scoring a trace creates a new score record; the old one stays. Filter on the latest by name when querying.
- DSPy native integration is recent and minor versions drift. Pin both langfuse and dspy versions; budget a re-pin per upgrade.
- The cloud tier is the pricing tier. Don't accidentally configure the SDK with the cloud host URL — point it at
http://localhost:3000(or your tunnel) for the self-hosted instance.
Risks
- Single-vendor open-source. Langfuse GmbH owns the project. Open-source license means you keep the data and the binary even if they pivot — Postgres is the source of truth, you own it.
- Trace storage growth. ClickHouse compresses well but a long-running production trace stream needs occasional pruning. Not a Day-1 concern.
- Observability is not eval. Langfuse stores scores; it doesn't compute them. The metric function still has to be honest. Bad metric + clean dashboard = dressed-up vibes.
Related
- DSPy — every compile and every module call lands in Langfuse with the native callback.
- LlamaIndex — retrieval / rerank / synthesis spans nest under the parent trace via the LlamaIndex instrumentor.
- Neo4j — qualitative state (Behavior nodes, Topic nodes) lives in Neo4j; quantitative traces live in Langfuse. Two systems, two jobs.
