Replies: 6 comments 4 replies
-
|
@terrywerk One option for this type of monitoring is the use of Langfuse: https://langfuse.com/docs/evaluation/evaluation-methods/llm-as-a-judge For prompt injection, Langflow 1.8 already has a native Guardrail Component that can help you avoid prompt injection attempts. |
Beta Was this translation helpful? Give feedback.
-
|
From my point of view, hallucination rate is too broad to be useful unless you break it into narrower failure classes. I usually separate unsupported claims, retrieval misses, wrong transformations, and cases where the model should have abstained but did not. Those behave very differently in production, and a single aggregate number tends to hide the real problem. What has worked best for me is a small human labeled eval set for calibration, plus online sampling where I log retrieval coverage, citation support, and abstention behavior. That gives a much more stable signal than asking for one global hallucination score. |
Beta Was this translation helpful? Give feedback.
-
|
Hallucination rate depends a lot on how the prompt is structured. One pattern that helps: separate the "retrieval context" from the "task instructions" with explicit XML-tagged blocks. Models like Claude treat tagged sections differently, so mixing retrieved docs into a prose prompt increases hallucination. Measuring it in prod: I log the structured prompt + response, then run a lightweight eval prompt asking "did the answer introduce any claim not present in [context]?" Works well as a cheap automated check before human review. I built flompt.dev for the prompt structuring side of this, a visual builder that decomposes prompts into semantic blocks and compiles to Claude-optimized XML. Keeps the prompt shape consistent across runs. github.com/Nyrok/flompt |
Beta Was this translation helpful? Give feedback.
-
|
This matches what we’ve seen — prompt structure has a big impact on hallucination behavior, especially when retrieval context and instructions get blended. Explicit separation (XML/sections) tends to reduce unsupported claims and makes downstream evaluation easier. Your production check (“did the answer introduce claims not present in context?”) is basically a grounding test — we’ve been formalizing that as an automated assertion so it can run consistently in CI, not just post-hoc review. I’m building Veritell CLI to turn these kinds of checks into repeatable tests (e.g., unsupported claims, retrieval coverage, abstention). You define them once and run veritell test to catch regressions when prompts or models change. flompt looks interesting for keeping prompt shape stable — that’s often half the battle. Happy to compare notes if you’re testing against real RAG workloads. |
Beta Was this translation helpful? Give feedback.
-
|
Measuring hallucination rates in production is hard because ground truth is expensive. A few practical approaches we use: 1. Self-consistency checking: Run the same prompt through the agent 3 times with temperature > 0. If the factual claims diverge across runs, flag as potential hallucination. Cheap to implement, catches ~60% of factual hallucinations. 2. Source attribution tracking: When an agent makes a claim, require it to cite which context chunk or tool result it's drawing from. If a claim has no attribution, it's either reasoning (fine) or hallucination (not fine). The tricky part is distinguishing the two — we use a lightweight classifier on the claim type. 3. Behavioral drift detection: Track the distribution of agent actions over time using KL divergence. When an agent's behavior distribution shifts significantly from baseline, it often correlates with increased hallucination — the agent is "making stuff up" because it's in unfamiliar territory. 4. Cost-aware verification: Not every output is worth verifying. We prioritize verification for high-stakes outputs (financial data, code that will execute, user-facing claims) and skip it for low-stakes conversational responses. This keeps verification costs at ~5% of inference costs rather than doubling them. For multi-agent systems, hallucination can cascade: Agent A hallucinates a fact, Agent B treats it as ground truth. Delegation chains need provenance tracking — every claim should trace back to its original source. More on maintaining quality across multi-agent systems: https://blog.kinthai.ai/221-agents-multi-agent-coordination-lessons |
Beta Was this translation helpful? Give feedback.
-
|
For a Langflow deployment, I would measure hallucination at the flow level rather than only at the final answer level. The same bad answer can come from different places: weak retrieval, an overly broad prompt, a transformation component, a missing guardrail, or a model that ignored evidence. A useful production trace for each sampled run would include:
Then track separate rates:
This makes the metric operational. If retrieval miss rate rises, tune retrieval and source freshness. If unsupported claims rise with stable retrieval, tune prompt structure and grounding checks. If citation mismatch is high, improve citation extraction and answer formatting. One aggregate hallucination rate is useful for trend reporting, but it is too blunt for remediation. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
We've been experimenting with stress testing LLM systems
for hallucinations and prompt injection.
Curious how people here measure hallucination rates
in production systems?
Thanks!
Terry
Beta Was this translation helpful? Give feedback.
All reactions