Evaluator tooling: concurrency, delta display, and tone rubric#327
Open
yangm2 wants to merge 3 commits intocodeforpdx:mainfrom
Open
Evaluator tooling: concurrency, delta display, and tone rubric#327yangm2 wants to merge 3 commits intocodeforpdx:mainfrom
yangm2 wants to merge 3 commits intocodeforpdx:mainfrom
Conversation
- repair UTF-8-as-Latin-1 mojibake in Vertex AI passages - fix city filter to include state-level (null) docs alongside city docs - add max_extractive_answer_count/segment_count to CityStateLawsInputSchema - RagBuilder: retry on httpx.ReadError and ServiceUnavailable - exclude surrogates from mojibake property test strategy
measure_evaluator_variance.py: - run evaluator calls concurrently (--max-workers, default 4) - --show-delta: fetch stored scores from LangSmith and show mean/sigma delta inline in the Per-Scenario Consistency table as 0.95(+0.04) results_display.py: - widen stat columns when baseline present to fit delta annotation tone.md: - add hedging-language and citation-completeness scoring criteria EVALUATION.md: - document --show-delta and --max-workers flags with example output
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What type of PR is this? (check all applicable)
Description
Improvements to the LLM-as-judge variance measurement tooling.
Concurrency —
measure_evaluator_variance.pynow runs evaluator calls withThreadPoolExecutor(default 4 workers,--max-workersto override). On a typical 20-scenario run this cuts wall time roughly 4×.--show-deltaflag — fetches stored scores from LangSmith for the same experiment and prints mean/sigma deltas inline in the Per-Scenario Consistency table as0.95(+0.04)/1.00(+0.00). Useful when testing a rubric change to see which scenarios improved or regressed without re-running the full agent.Tone evaluator — added hedging-language and citation-completeness scoring criteria to
tone.md, reflecting feedback that vague hedges ("generally", "typically") and missing statute links were the most common failure modes.Related Tickets & Documents
QA Instructions, Screenshots, Recordings
Added/updated tests?
test_measure_evaluator_variance.pycovers the unit-testable partsDocumentation
EVALUATION.mdupdated with new flags and example outputIf this PR changes the system architecture,
Architecture.mdhas been updated[optional] Are there any post deployment tasks we need to perform?