Skip to content

feat: inspect and bound GEval retrieval context prompts#2742

Open
RitwijParmar wants to merge 6 commits into
confident-ai:mainfrom
RitwijParmar:codex/deepeval-geval-retrieval-budget
Open

feat: inspect and bound GEval retrieval context prompts#2742
RitwijParmar wants to merge 6 commits into
confident-ai:mainfrom
RitwijParmar:codex/deepeval-geval-retrieval-budget

Conversation

@RitwijParmar

@RitwijParmar RitwijParmar commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Fixes #1764.

This adds an opt-in retrieval context budget for GEval, a no-LLM inspection path for large RAG judge prompts, and an evidence coverage report so users can see whether prompt compaction removed terms the judge still needs.

What changed

  • Added max_retrieval_context_tokens to GEval. Default behavior is unchanged unless this option is set.
  • Large retrieval_context values are compacted before they enter the judge prompt.
  • The compactor preserves source labels for RetrievedContextData, keeps head and tail evidence from visible chunks, and inserts explicit omission markers.
  • Chunks are ranked by lexical overlap against selected input, actual output, and expected output fields.
  • Added structured budget reports via get_retrieval_context_budget_report(test_case).
  • Budget reports include original token estimate, rendered token estimate, compression ratio, visible chunks, omitted chunks, per-chunk source metadata, relevance scores, and evidence coverage.
  • Added get_retrieval_context_evidence_coverage(test_case) for the coverage ratio, covered terms, missing terms, and warning message.
  • Added preview_evaluation_prompt(test_case) so users can inspect the exact bounded judge prompt in CI or while tuning RAG metrics without calling the evaluation model.
  • Refactored GEval prompt construction so sync, async, and preview paths share the same prompt builder.
  • Documented the debugging workflow in the G-Eval docs.
  • Added synthetic large-RAG prompt-budget regression tests for compression, source preservation, relevance ranking, missing evidence detection, and prompt preview.

Why this matters
Large RAG contexts can make custom GEval prompts expensive and hard to trust. Clipping blindly is risky because it can hide the evidence that should support or refute the answer. This PR gives users a bounded prompt and a diagnostic report that says what survived the budget and what evidence terms disappeared.

Validation

  • uv run --isolated --python 3.11 --with-editable . python -m pytest tests/test_metrics/test_g_eval_utils.py tests/test_metrics/test_g_eval_prompt_budget.py -q
    • 14 passed
  • python3 -m black --check deepeval/metrics/g_eval/g_eval.py deepeval/metrics/g_eval/utils.py deepeval/metrics/g_eval/__init__.py tests/test_metrics/test_g_eval_utils.py tests/test_metrics/test_g_eval_prompt_budget.py
  • uvx ruff check deepeval/metrics/g_eval/g_eval.py deepeval/metrics/g_eval/utils.py deepeval/metrics/g_eval/__init__.py tests/test_metrics/test_g_eval_utils.py tests/test_metrics/test_g_eval_prompt_budget.py
  • python3 -m compileall -q deepeval/metrics/g_eval tests/test_metrics/test_g_eval_prompt_budget.py tests/test_metrics/test_g_eval_utils.py
  • git diff --check

Note on current CI

  • Vercel requires maintainer authorization for fork deployment.

@vercel

vercel Bot commented Jun 10, 2026

Copy link
Copy Markdown

@RitwijParmar is attempting to deploy a commit to the Confident AI Team on Vercel.

A member of the Team first needs to authorize it.

@RitwijParmar RitwijParmar changed the title feat: bound GEval retrieval context prompts feat: inspect and bound GEval retrieval context prompts Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deepeval not evaluating properly for large retrived context.

1 participant