Evaluator tooling: concurrency, delta display, and tone rubric by yangm2 · Pull Request #327 · codeforpdx/tenantfirstaid

yangm2 · 2026-04-18T01:52:00Z

What type of PR is this? (check all applicable)

Feature
Optimization

Description

Improvements to the LLM-as-judge variance measurement tooling.

Concurrency — measure_evaluator_variance.py now runs evaluator calls with ThreadPoolExecutor (default 4 workers, --max-workers to override). On a typical 20-scenario run this cuts wall time roughly 4×.

--show-delta flag — fetches stored scores from LangSmith for the same experiment and prints mean/sigma deltas inline in the Per-Scenario Consistency table as 0.95(+0.04) / 1.00(+0.00). Useful when testing a rubric change to see which scenarios improved or regressed without re-running the full agent.

Tone evaluator — added hedging-language and citation-completeness scoring criteria to tone.md, reflecting feedback that vague hedges ("generally", "typically") and missing statute links were the most common failure modes.

Merge order: merge #324 first. This branch is based on pr/rag-bug-fixes and will need to be rebased on main after #324 lands.

Related Tickets & Documents

Depends on Fix mojibake in RAG results and city filter #324

QA Instructions, Screenshots, Recordings

cd backend
uv run python -m evaluate.measure_evaluator_variance --help

# with delta (requires LANGSMITH_API_KEY and an existing experiment)
uv run python -m evaluate.measure_evaluator_variance \
  --experiment-name <name> --show-delta --max-workers 8

Added/updated tests?

No, and this is why: the evaluator tooling calls LangSmith APIs that require a live key; existing test_measure_evaluator_variance.py covers the unit-testable parts

Documentation

EVALUATION.md updated with new flags and example output
If this PR changes the system architecture, Architecture.md has been updated

[optional] Are there any post deployment tasks we need to perform?

- repair UTF-8-as-Latin-1 mojibake in Vertex AI passages - fix city filter to include state-level (null) docs alongside city docs - add max_extractive_answer_count/segment_count to CityStateLawsInputSchema - RagBuilder: retry on httpx.ReadError and ServiceUnavailable - exclude surrogates from mojibake property test strategy

measure_evaluator_variance.py: - run evaluator calls concurrently (--max-workers, default 4) - --show-delta: fetch stored scores from LangSmith and show mean/sigma delta inline in the Per-Scenario Consistency table as 0.95(+0.04) results_display.py: - widen stat columns when baseline present to fit delta annotation tone.md: - add hedging-language and citation-completeness scoring criteria EVALUATION.md: - document --show-delta and --max-workers flags with example output

yangm2 added 2 commits April 17, 2026 18:41

yangm2 mentioned this pull request Apr 18, 2026

More evaluation work #321

Closed

11 tasks

yangm2 self-assigned this Apr 18, 2026

yangm2 added backend Bot implementation and other backend concerns enhancement New feature or request infrastructure Pull requests related to infrastructure and underlying workflows labels Apr 18, 2026

yangm2 requested a review from leekahung April 18, 2026 01:59

using sandbox for analyze-experiment is safer and less annoying

8973cd1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluator tooling: concurrency, delta display, and tone rubric#327

Evaluator tooling: concurrency, delta display, and tone rubric#327
yangm2 wants to merge 3 commits intocodeforpdx:mainfrom
yangm2:pr/evaluator-tooling

yangm2 commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yangm2 commented Apr 18, 2026

What type of PR is this? (check all applicable)

Description

Related Tickets & Documents

QA Instructions, Screenshots, Recordings

Added/updated tests?

Documentation

[optional] Are there any post deployment tasks we need to perform?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant