Measuring hallucination rates in production systems #12111

terrywerk · 2026-03-08T16:47:31Z

terrywerk
Mar 8, 2026

We've been experimenting with stress testing LLM systems
for hallucinations and prompt injection.

Curious how people here measure hallucination rates
in production systems?

Thanks!
Terry

Empreiteiro · 2026-03-09T17:00:04Z

Empreiteiro
Mar 9, 2026
Maintainer

@terrywerk One option for this type of monitoring is the use of Langfuse: https://langfuse.com/docs/evaluation/evaluation-methods/llm-as-a-judge

For prompt injection, Langflow 1.8 already has a native Guardrail Component that can help you avoid prompt injection attempts.

3 replies

terrywerk Mar 10, 2026
Author

Thanks — Langfuse looks solid for tracing and observability.

One thing I’ve been trying to understand better is how teams move from observability to actual reliability signals. Traces are great for debugging individual failures, but in production it still seems hard to answer questions like:

how often are answers unsupported by retrieved documents
how often retrieval misses the correct document entirely
how often the model should abstain but doesn’t

Curious if teams using Langfuse typically layer additional evaluation pipelines on top of traces, or if people are mostly doing manual review of samples.

Still researching how different teams handle this in production.

Empreiteiro Mar 10, 2026
Maintainer

@terrywerk, great questions!

There isn't a single answer to your question, but to find those answers, the best practice is to create a dataset for evaluating your agent.

This dataset should have examples of questions that the agent will receive and also the answers (answer key).

With this question and answer pair, you can use evaluation systems (like Langfuse) to create these agent evaluation metrics.

With this insight, you will be able to:

Adjust prompt
Adjust context
Adjust RAG (chunk size, number of chunks)

https://langfuse.com/docs/evaluation/overview
https://developers.openai.com/cookbook/examples/agents_sdk/evaluate_agents/
https://docs.langchain.com/langsmith/evaluation-quickstart

terrywerk Mar 16, 2026
Author

Totally agree, having a representative eval dataset with question/answer pairs is foundational. Without that, it’s hard to tell whether changes to prompts, retrieval, or chunking are actually improving behavior or just shifting failure modes.

One thing we’ve run into in production is that accuracy scores alone don’t explain why an agent failed. It often helps to layer in behavioral checks (e.g., grounding, retrieval coverage, abstention) so regressions show up immediately when prompts or models change.

I’ve been working on Veritell CLI to turn those kinds of checks into repeatable tests that run alongside an eval dataset; you define assertions once and run Veritell test before deployment or in CI.

The resources you linked are great starting points for building the dataset itself.
Examples + access for the testing side: https://veritell.ai/

Curious how large your eval sets typically are for real workloads.

aniruddhaadak80 · 2026-03-09T22:13:44Z

aniruddhaadak80
Mar 9, 2026

From my point of view, hallucination rate is too broad to be useful unless you break it into narrower failure classes.

I usually separate unsupported claims, retrieval misses, wrong transformations, and cases where the model should have abstained but did not. Those behave very differently in production, and a single aggregate number tends to hide the real problem.

What has worked best for me is a small human labeled eval set for calibration, plus online sampling where I log retrieval coverage, citation support, and abstention behavior. That gives a much more stable signal than asking for one global hallucination score.

1 reply

terrywerk Mar 10, 2026
Author

This breakdown matches what I’ve been seeing as well.

When people report a single hallucination rate it usually mixes several different failure modes together, and the fixes end up being completely different (retrieval tuning vs prompt changes vs abstention logic).

Out of curiosity — how much of that workflow is automated for you today?

Are you mostly labeling a calibration set and then sampling production traffic, or have you found a good way to automatically detect things like unsupported claims or retrieval misses at scale?

Trying to understand what parts of this are still mostly manual for teams in production systems.

Nyrok · 2026-03-10T12:31:55Z

Nyrok
Mar 10, 2026

Hallucination rate depends a lot on how the prompt is structured. One pattern that helps: separate the "retrieval context" from the "task instructions" with explicit XML-tagged blocks. Models like Claude treat tagged sections differently, so mixing retrieved docs into a prose prompt increases hallucination.

Measuring it in prod: I log the structured prompt + response, then run a lightweight eval prompt asking "did the answer introduce any claim not present in [context]?" Works well as a cheap automated check before human review.

I built flompt.dev for the prompt structuring side of this, a visual builder that decomposes prompts into semantic blocks and compiles to Claude-optimized XML. Keeps the prompt shape consistent across runs. github.com/Nyrok/flompt

0 replies

terrywerk · 2026-03-16T01:20:31Z

terrywerk
Mar 16, 2026
Author

This matches what we’ve seen — prompt structure has a big impact on hallucination behavior, especially when retrieval context and instructions get blended. Explicit separation (XML/sections) tends to reduce unsupported claims and makes downstream evaluation easier.

Your production check (“did the answer introduce claims not present in context?”) is basically a grounding test — we’ve been formalizing that as an automated assertion so it can run consistently in CI, not just post-hoc review.

I’m building Veritell CLI to turn these kinds of checks into repeatable tests (e.g., unsupported claims, retrieval coverage, abstention). You define them once and run veritell test to catch regressions when prompts or models change.

flompt looks interesting for keeping prompt shape stable — that’s often half the battle.
Examples + access: https://veritell.ai/

Happy to compare notes if you’re testing against real RAG workloads.

0 replies

kinthaiofficial · 2026-04-28T23:57:46Z

kinthaiofficial
Apr 28, 2026

Measuring hallucination rates in production is hard because ground truth is expensive. A few practical approaches we use:

1. Self-consistency checking: Run the same prompt through the agent 3 times with temperature > 0. If the factual claims diverge across runs, flag as potential hallucination. Cheap to implement, catches ~60% of factual hallucinations.

2. Source attribution tracking: When an agent makes a claim, require it to cite which context chunk or tool result it's drawing from. If a claim has no attribution, it's either reasoning (fine) or hallucination (not fine). The tricky part is distinguishing the two — we use a lightweight classifier on the claim type.

3. Behavioral drift detection: Track the distribution of agent actions over time using KL divergence. When an agent's behavior distribution shifts significantly from baseline, it often correlates with increased hallucination — the agent is "making stuff up" because it's in unfamiliar territory.

4. Cost-aware verification: Not every output is worth verifying. We prioritize verification for high-stakes outputs (financial data, code that will execute, user-facing claims) and skip it for low-stakes conversational responses. This keeps verification costs at ~5% of inference costs rather than doubling them.

For multi-agent systems, hallucination can cascade: Agent A hallucinates a fact, Agent B treats it as ground truth. Delegation chains need provenance tracking — every claim should trace back to its original source.

More on maintaining quality across multi-agent systems: https://blog.kinthai.ai/221-agents-multi-agent-coordination-lessons

0 replies

musaabhasan · 2026-05-08T18:14:35Z

musaabhasan
May 8, 2026

For a Langflow deployment, I would measure hallucination at the flow level rather than only at the final answer level. The same bad answer can come from different places: weak retrieval, an overly broad prompt, a transformation component, a missing guardrail, or a model that ignored evidence.

A useful production trace for each sampled run would include:

flow id and version,
prompt/template version,
model/provider/version,
retrieved document ids and chunk scores,
final answer with citation map,
guardrail decisions,
tool/component outputs used by the answer,
whether the answer should have abstained,
and reviewer labels for failure class.

Then track separate rates:

Unsupported answer claims: claims not grounded in retrieved context or tool output.
Citation mismatch: citation exists but does not support the cited claim.
Retrieval miss: answer failed because the right source was not retrieved.
Stale context: retrieved source was once valid but no longer current.
Abstention miss: model answered when confidence or evidence was insufficient.
Prompt-injection compliance: answer followed an instruction from retrieved content or user input that should have been treated as untrusted.

This makes the metric operational. If retrieval miss rate rises, tune retrieval and source freshness. If unsupported claims rise with stable retrieval, tune prompt structure and grounding checks. If citation mismatch is high, improve citation extraction and answer formatting. One aggregate hallucination rate is useful for trend reporting, but it is too blunt for remediation.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measuring hallucination rates in production systems #12111

Uh oh!

{{title}}

Uh oh!

Replies: 6 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Measuring hallucination rates in production systems #12111

Uh oh!

terrywerk Mar 8, 2026

Replies: 6 comments · 4 replies

Uh oh!

Empreiteiro Mar 9, 2026 Maintainer

Uh oh!

terrywerk Mar 10, 2026 Author

Uh oh!

Empreiteiro Mar 10, 2026 Maintainer

Uh oh!

terrywerk Mar 16, 2026 Author

Uh oh!

aniruddhaadak80 Mar 9, 2026

Uh oh!

terrywerk Mar 10, 2026 Author

Uh oh!

Nyrok Mar 10, 2026

Uh oh!

terrywerk Mar 16, 2026 Author

Uh oh!

kinthaiofficial Apr 28, 2026

Uh oh!

musaabhasan May 8, 2026

terrywerk
Mar 8, 2026

Replies: 6 comments 4 replies

Empreiteiro
Mar 9, 2026
Maintainer

terrywerk Mar 10, 2026
Author

Empreiteiro Mar 10, 2026
Maintainer

terrywerk Mar 16, 2026
Author

aniruddhaadak80
Mar 9, 2026

terrywerk Mar 10, 2026
Author

Nyrok
Mar 10, 2026

terrywerk
Mar 16, 2026
Author

kinthaiofficial
Apr 28, 2026

musaabhasan
May 8, 2026