Use DeepEval faithfulness metric and remove custom version. #219

lkacenja · 2025-07-09T15:49:19Z

DeepEval added a multimodal faithfulness metric while we weren't looking. We created a custom one and used it in our LLM-based summary metric and stand alone in the exception check suite. We should replace these usages with the DeepEval version.

This PR removes the custom faithfulness metric and replaces it with the DeepEval version. We did some fairly robust testing with the evaluation framework this time. This branch produced about a 2% drop in overall accuracy as compared to main at the same time. The changes in accuracy did not seem to be related to the faithfulness metric. We think the dip represents overall variability in the evaluation suite, perhaps related to sample size.

Here are the dashboard links for posterity:
Main
This Branch

What additional steps are required to test this branch locally?

Standard docker compose rebuild and up. Don't forget to add the API keys.

Are there any rake tasks to run on production?

No

lkacenja · 2025-07-17T21:13:18Z

@allisonmorgan I think this PR is ready for review, but we can hold off on merging it if we want to try to run the main branch evaluation a few more times to build up a larger statistical base.

allisonmorgan · 2025-07-18T22:46:20Z

These two links look the same to me:

Here are the dashboard links for posterity:
Main
This Branch

I didn't see a saved view for the main branch, so I saved one here.

allisonmorgan

Leaving some comments for now. Want to follow up with you about more testing next week.

allisonmorgan · 2025-07-18T23:14:06Z

.github/workflows/eval.yml

        type: choice
        description: Evaluation Model
        options:
+          - gemini-2.5-flash


allisonmorgan · 2025-07-18T23:17:48Z

python_components/accessibility_scan/results.json

I assume this wasn't meant to be committed?

Whoops. Time for a repo cleanup. Removed now.

allisonmorgan · 2025-07-18T23:21:06Z

spec/features/document_spec.rb

    # Add some inferences.
    DocumentInference.create(inference_type: "exception:is_application", inference_value: "True", inference_reason: "This is not used as an application or means of participation in government services.", document_id: doc.id)
-    DocumentInference.create(inference_type: "exception:is_third_party", inference_value: "False", inference_reason: "This is not third party.", document_id: doc.id)
+    DocumentInference.create(inference_type: "exception:is_archival", inference_value: "True", inference_reason: "This thing was made in 1988 and hasn't been opened since then.", document_id: doc.id)


Thanks for catching this test change.

allisonmorgan · 2025-07-18T23:39:05Z

python_components/evaluation/evaluation/exception/evaluation.py

            {
                "metric_name": f"deepeval_mllm_faithfulness:{exception}",
-                "metric_version": FAITHFULNESS_VERSION,
+                "metric_version": 2,


I was looking at the requirements.txt file and I see that the deepeval library is referring to CFA's fork of it. Just wanted to confirm: this means that if deepeval makes any changes to their definition of the MultimodalFaithfulnessMetric, our definition (and the metric version) shouldn't change right?

our definition (and the metric version) shouldn't change right?

Right, the dependency is frozen, unless we update the fork. They finally merged the PR I made that fork to work on. I've been meaning to return this to the official repository. This seems like a good time to do so.

Actually, let's do this on a separate effort. I created a ticket for it here and put it at the top of the heap.

lkacenja · 2025-07-21T14:43:28Z

I didn't see a saved view for the main branch, so I saved one here.

Thanks for creating the hex link for main!

allisonmorgan

This looks great! Glad we found that running the experiment more served to stabilize the results and confirmed that the deepeval changes aren't qualitatively different.

Use DeepEval faithfulness metric and remove custom version.

6e9a060

lkacenja changed the base branch from main to dev July 9, 2025 15:49

lkacenja self-assigned this Jul 9, 2025

lkacenja added 4 commits July 9, 2025 09:50

Bump summary score version.

ca61991

Make flash model default.

224093e

Change example exception.

aa85b1c

Bump

74ec00f

lkacenja requested a review from allisonmorgan July 17, 2025 21:11

lkacenja marked this pull request as ready for review July 17, 2025 21:12

allisonmorgan reviewed Jul 18, 2025

View reviewed changes

Remove accidental inclusion.

203456d

allisonmorgan self-requested a review July 22, 2025 18:55

allisonmorgan approved these changes Jul 22, 2025

View reviewed changes

lkacenja merged commit 487434b into dev Jul 22, 2025
204 checks passed

Use DeepEval faithfulness metric and remove custom version. #219

Use DeepEval faithfulness metric and remove custom version. #219

Uh oh!

Conversation

lkacenja commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lkacenja commented Jul 17, 2025

Uh oh!

allisonmorgan commented Jul 18, 2025

Uh oh!

allisonmorgan left a comment

Choose a reason for hiding this comment

Uh oh!

allisonmorgan Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

allisonmorgan Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

lkacenja Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

allisonmorgan Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

allisonmorgan Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

lkacenja Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

lkacenja Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

lkacenja commented Jul 21, 2025

Uh oh!

allisonmorgan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lkacenja commented Jul 9, 2025 •

edited

Loading