Asap 221 migrate off preview model #305

lkacenja · 2025-09-09T17:01:33Z

The exception check is currently using gemini-2.5-pro-preview-03-25. We should investigate some other models and try to get off the preview.

This PR adds the following:

Changes summary model from gemini-flash-2.0 to gemini-flash-2.5.
Changes exception model from gemini-2.5-pro-preview-03-25 to gemini-2.5-pro
Adds configuration and model support for OpenAI models
Updates dependencies for document inference
Improves the evaluation workflow
- More consistent error handling
- Waits for images to be deployed to Lambdas
- Ability to run a specific file
- Tracking inference duration

Evaluation

The following two dashboards were used in assessing evaluation results:

Main dashboard: https://code-for-america.hex.tech/global/app/ASAP-Evaluation-Dashboard-0307kgWXk2slUqaovViuFf/latest
Supplemental dashboard: https://code-for-america.hex.tech/global/app/Eval-Score-Validation-030Rlb6LuP6iL5zDQynk2A/latest

Evaluation Results: Summary

The summarization process is currently using gemini-flash-2.0. It seems reasonable that the 2.0 family of models will eventually be discontinued. This seemed like a reasonable moment to compare a few other options. I tried running a evaluation of summaries created by gemini-flash-2.5 and gemini-flash-lite-2.5.

Baseline Values (52e7441, ~200 samples):
Rouge Mean: 0.38100022576030146
LLM Summarization Mean: 0.23819095477386934

Model	Inference Duration Mean	Rouge Mean	LLM Summarization Mean
gemini-flash-2.5 (~200 samples, Feature: `c3e0b48`, `0876604`)	7.31s	0.34988729797702706 (-3.1%)	0.298 (+6%)
gemini-flash-lite-2.5 (~200 samples, `0dfb731`, `dac863e`)	6.7s	0.35 (-3.5%)	0.27 (+2.8%)

Given the speed and metric changes, gemini-flash-2.5 seems like a safe move for summarization. The drop in rouge score and increase in llm summarization metric essentially wash out. This change may be random, though ttest values suggest otherwise. It may also be related to increased reasoning capacity in the model, allowing summarization with less direct regurgitation.

I wanted to try gpt-5-mini for summarization, but was having issues running the summarization metric with output from gpt-5-mini. This could have been severe rate limiting by Google. The summarization metric does make at least four LLM calls per document. Here is a ticket to explore and possibly replace this approach.

Evaluation Results: Exception Check

The evaluation process is currently using the gemini-2.5-pro-preview-03-25 model, which feels like it could be discontinued at any time. I tried running evaluation exception checks with gemini-pro-2.5 and gpt-5 with default the reasoning setting.

Baseline Values (b7fee4e, ~200 samples):
Overall accuracy: 98.1%
Exception accuracy: 33%
Archival accuracy: 81.5%

Model	Inference Duration Mean	Overall Accuracy
gemini-pro-2.5 (~200 samples, `a44a6ad`)	18.5s	98.9 (+0.8%)
gpt-5 (~100 samples, default reasoning, `df0128b`)	39.31s	54.1% (-44%)

Here is an atomic breakdown of the metrics comparing the two Gemini models:

Given the relatively small change in overall accuracy and the individual metrics, changing from gemini-2.5-pro-preview-03-25 to gemini-pro-2.5 seems safe. The relatively big drop off in gpt-5 is hard to explain and would probably require more digging around. For now, sticking with Gemini probably makes the most sense.

What additional steps are required to test this branch locally?

Rebuild the images docker-compose build --no-cache.
Are there any areas you would like extra review?

The above evaluation results.

Are there any rake tasks to run on production?

No

…ious.

4dh

Looks good to merge- seeing gemini-2.5-pro, as expected (thanks for your help!) and the inference time is generally in line with your testing.

lkacenja added 26 commits September 3, 2025 16:03

Update approved models.

0c3793e

Update dependencies. Track inference duration as a metric.

9793df6

Update workflow model options.

0dfb731

Remove flash lite model and anthropic models.

a44a6ad

Add a secret for OpenAI API keys and allow Lambda access.

3473594

Get OpenAI working locally.

94596c2

Disable streaming and add OpenAI models to list of availble models.

75ccb6f

Add OpenAI models to Github actions.

df0128b

Try setting reasoning effort to "minimal".

a4dfc7b

Add github workflow ability to run specific files.

7895ebe

Combine workflow steps.

52e49bb

Remove OpenAI specific kwargs and try to make metric failure more obv…

2804b67

…ious.

Correctly default document limiter.

adfcf8e

Make jq expression return a list.

836ac26

Try to get a helpful error out of deepeval.

aa8057e

Try to get lambdas to fail properly.

56f35d5

Switch to 2.5 gemini models in the UI.

f570d3e

Remove OpenAI models from eval yaml.

e3d45fb

Simulate failure to test proper handling.

db0c130

Try to catch the errors and fail the job.

ea431f1

Remove intentional fail and return streaming.

c3e0b48

Attempt to wait for lambda to be updated.

2e51989

Don't kill the whole matrix if one job dies.

0876604

Clean up linting.

bc41cc2

Return OpenAI and make handling more flexible.

cfe54aa

Fix linting issues.

4d962e4

lkacenja temporarily deployed to staging September 9, 2025 17:38 — with GitHub Actions Inactive

lkacenja temporarily deployed to staging September 9, 2025 17:42 — with GitHub Actions Inactive

lkacenja had a problem deploying to staging September 9, 2025 17:42 — with GitHub Actions Failure

lkacenja temporarily deployed to staging September 9, 2025 17:42 — with GitHub Actions Inactive

lkacenja temporarily deployed to staging September 10, 2025 17:07 — with GitHub Actions Inactive

lkacenja had a problem deploying to staging September 10, 2025 17:07 — with GitHub Actions Failure

lkacenja temporarily deployed to staging September 10, 2025 17:07 — with GitHub Actions Inactive

lkacenja had a problem deploying to staging September 10, 2025 17:07 — with GitHub Actions Failure

lkacenja temporarily deployed to staging September 10, 2025 17:11 — with GitHub Actions Inactive