Skip to content

Evaluator tooling: concurrency, delta display, and tone rubric#327

Open
yangm2 wants to merge 3 commits intocodeforpdx:mainfrom
yangm2:pr/evaluator-tooling
Open

Evaluator tooling: concurrency, delta display, and tone rubric#327
yangm2 wants to merge 3 commits intocodeforpdx:mainfrom
yangm2:pr/evaluator-tooling

Conversation

@yangm2
Copy link
Copy Markdown
Contributor

@yangm2 yangm2 commented Apr 18, 2026

What type of PR is this? (check all applicable)

  • Feature
  • Optimization

Description

Improvements to the LLM-as-judge variance measurement tooling.

Concurrencymeasure_evaluator_variance.py now runs evaluator calls with ThreadPoolExecutor (default 4 workers, --max-workers to override). On a typical 20-scenario run this cuts wall time roughly 4×.

--show-delta flag — fetches stored scores from LangSmith for the same experiment and prints mean/sigma deltas inline in the Per-Scenario Consistency table as 0.95(+0.04) / 1.00(+0.00). Useful when testing a rubric change to see which scenarios improved or regressed without re-running the full agent.

Tone evaluator — added hedging-language and citation-completeness scoring criteria to tone.md, reflecting feedback that vague hedges ("generally", "typically") and missing statute links were the most common failure modes.

Merge order: merge #324 first. This branch is based on pr/rag-bug-fixes and will need to be rebased on main after #324 lands.

Related Tickets & Documents

QA Instructions, Screenshots, Recordings

cd backend
uv run python -m evaluate.measure_evaluator_variance --help

# with delta (requires LANGSMITH_API_KEY and an existing experiment)
uv run python -m evaluate.measure_evaluator_variance \
  --experiment-name <name> --show-delta --max-workers 8

Added/updated tests?

  • No, and this is why: the evaluator tooling calls LangSmith APIs that require a live key; existing test_measure_evaluator_variance.py covers the unit-testable parts

Documentation

  • EVALUATION.md updated with new flags and example output

  • If this PR changes the system architecture, Architecture.md has been updated

[optional] Are there any post deployment tasks we need to perform?

yangm2 added 2 commits April 17, 2026 18:41
- repair UTF-8-as-Latin-1 mojibake in Vertex AI passages
- fix city filter to include state-level (null) docs alongside city docs
- add max_extractive_answer_count/segment_count to CityStateLawsInputSchema
- RagBuilder: retry on httpx.ReadError and ServiceUnavailable
- exclude surrogates from mojibake property test strategy
measure_evaluator_variance.py:
- run evaluator calls concurrently (--max-workers, default 4)
- --show-delta: fetch stored scores from LangSmith and show mean/sigma
  delta inline in the Per-Scenario Consistency table as 0.95(+0.04)

results_display.py:
- widen stat columns when baseline present to fit delta annotation

tone.md:
- add hedging-language and citation-completeness scoring criteria

EVALUATION.md:
- document --show-delta and --max-workers flags with example output
@yangm2 yangm2 mentioned this pull request Apr 18, 2026
11 tasks
@yangm2 yangm2 self-assigned this Apr 18, 2026
@yangm2 yangm2 added backend Bot implementation and other backend concerns enhancement New feature or request infrastructure Pull requests related to infrastructure and underlying workflows labels Apr 18, 2026
@yangm2 yangm2 requested a review from leekahung April 18, 2026 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend Bot implementation and other backend concerns enhancement New feature or request infrastructure Pull requests related to infrastructure and underlying workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant