Skip to content

Improve SimpleQA eval observability with structured logging and latency metadata#1790

Open
SeanHe727 wants to merge 1 commit into
assafelovic:mainfrom
SeanHe727:add-simpleqa-eval-logging
Open

Improve SimpleQA eval observability with structured logging and latency metadata#1790
SeanHe727 wants to merge 1 commit into
assafelovic:mainfrom
SeanHe727:add-simpleqa-eval-logging

Conversation

@SeanHe727

Copy link
Copy Markdown

Summary

This PR improves the SimpleQA evaluation pipeline by adding structured logging, per-query latency tracking, and reproducibility metadata.

Changes

  • Added per-query latency measurement using time.perf_counter()
  • Added structured JSON logging with:
    • run metadata: researcher model, timestamp, git commit
    • aggregate metrics: accuracy, average latency, p50 latency, p95 latency
    • per-query evaluation results
  • Updated deprecated SMART_LLM_MODEL usage to SMART_LLM
  • Added a guard for empty generator results to avoid next() failures
  • Added documentation / example output for the structured eval log

Motivation

The previous eval output was primarily console-based, which made it harder to compare runs, track latency regressions, and reproduce evaluation results across model and configuration changes.

Compatibility

Existing eval behavior is preserved unless structured logging is enabled.

Testing

  • Ran SimpleQA eval on a small sample
  • Verified structured log output is valid JSON
  • Verified aggregate latency metrics are generated correctly
  • Verified existing console output still works

@SeanHe727

Copy link
Copy Markdown
Author

Thanks for reviewing! This PR is intended as a small, backward-compatible improvement to make SimpleQA eval runs easier to reproduce and compare.

It adds structured JSON logging, per-query latency tracking, and run metadata without changing the default eval behavior. Happy to adjust the output format or scope based on maintainer preferences.

@SeanHe727 SeanHe727 force-pushed the add-simpleqa-eval-logging branch from 863ce4b to d45c2c7 Compare May 31, 2026 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant