Skip to content

Conversation

@lkacenja
Copy link
Contributor

@lkacenja lkacenja commented Sep 9, 2025

The exception check is currently using gemini-2.5-pro-preview-03-25. We should investigate some other models and try to get off the preview.

This PR adds the following:

  • Changes summary model from gemini-flash-2.0 to gemini-flash-2.5.
  • Changes exception model from gemini-2.5-pro-preview-03-25 to gemini-2.5-pro
  • Adds configuration and model support for OpenAI models
  • Updates dependencies for document inference
  • Improves the evaluation workflow
    • More consistent error handling
    • Waits for images to be deployed to Lambdas
    • Ability to run a specific file
    • Tracking inference duration

Evaluation

The following two dashboards were used in assessing evaluation results:

Evaluation Results: Summary

The summarization process is currently using gemini-flash-2.0. It seems reasonable that the 2.0 family of models will eventually be discontinued. This seemed like a reasonable moment to compare a few other options. I tried running a evaluation of summaries created by gemini-flash-2.5 and gemini-flash-lite-2.5.

Baseline Values (52e7441, ~200 samples):
Rouge Mean: 0.38100022576030146
LLM Summarization Mean: 0.23819095477386934

Model Inference Duration Mean Rouge Mean LLM Summarization Mean
gemini-flash-2.5 (~200 samples, Feature: c3e0b48, 0876604) 7.31s 0.34988729797702706 (-3.1%) 0.298 (+6%)
gemini-flash-lite-2.5 (~200 samples, 0dfb731, dac863e) 6.7s 0.35 (-3.5%) 0.27 (+2.8%)

Given the speed and metric changes, gemini-flash-2.5 seems like a safe move for summarization. The drop in rouge score and increase in llm summarization metric essentially wash out. This change may be random, though ttest values suggest otherwise. It may also be related to increased reasoning capacity in the model, allowing summarization with less direct regurgitation.

I wanted to try gpt-5-mini for summarization, but was having issues running the summarization metric with output from gpt-5-mini. This could have been severe rate limiting by Google. The summarization metric does make at least four LLM calls per document. Here is a ticket to explore and possibly replace this approach.

Evaluation Results: Exception Check

The evaluation process is currently using the gemini-2.5-pro-preview-03-25 model, which feels like it could be discontinued at any time. I tried running evaluation exception checks with gemini-pro-2.5 and gpt-5 with default the reasoning setting.

Baseline Values (b7fee4e, ~200 samples):
Overall accuracy: 98.1%
Exception accuracy: 33%
Archival accuracy: 81.5%

Model Inference Duration Mean Overall Accuracy
gemini-pro-2.5 (~200 samples, a44a6ad) 18.5s 98.9 (+0.8%)
gpt-5 (~100 samples, default reasoning, df0128b) 39.31s 54.1% (-44%)

Here is an atomic breakdown of the metrics comparing the two Gemini models:

Screenshot 2025-09-09 at 3 57 33 PM

Given the relatively small change in overall accuracy and the individual metrics, changing from gemini-2.5-pro-preview-03-25 to gemini-pro-2.5 seems safe. The relatively big drop off in gpt-5 is hard to explain and would probably require more digging around. For now, sticking with Gemini probably makes the most sense.

  • What additional steps are required to test this branch locally?

    Rebuild the images docker-compose build --no-cache.

  • Are there any areas you would like extra review?

The above evaluation results.

  • Are there any rake tasks to run on production?

No

@lkacenja lkacenja self-assigned this Sep 10, 2025
@lkacenja lkacenja marked this pull request as ready for review September 10, 2025 18:48
@lkacenja lkacenja requested a review from 4dh September 10, 2025 18:49
Copy link

@4dh 4dh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to merge- seeing gemini-2.5-pro, as expected (thanks for your help!) and the inference time is generally in line with your testing.

@lkacenja lkacenja merged commit 275cc33 into dev Sep 15, 2025
109 of 133 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants