Skip to content

Conversation

@lkacenja
Copy link
Contributor

@lkacenja lkacenja commented Jul 9, 2025

DeepEval added a multimodal faithfulness metric while we weren't looking. We created a custom one and used it in our LLM-based summary metric and stand alone in the exception check suite. We should replace these usages with the DeepEval version.

This PR removes the custom faithfulness metric and replaces it with the DeepEval version. We did some fairly robust testing with the evaluation framework this time. This branch produced about a 2% drop in overall accuracy as compared to main at the same time. The changes in accuracy did not seem to be related to the faithfulness metric. We think the dip represents overall variability in the evaluation suite, perhaps related to sample size.

Here are the dashboard links for posterity:
Main
This Branch

  • What additional steps are required to test this branch locally?

Standard docker compose rebuild and up. Don't forget to add the API keys.

  • Are there any rake tasks to run on production?

No

@lkacenja lkacenja changed the base branch from main to dev July 9, 2025 15:49
@lkacenja lkacenja self-assigned this Jul 9, 2025
@lkacenja lkacenja requested a review from allisonmorgan July 17, 2025 21:11
@lkacenja lkacenja marked this pull request as ready for review July 17, 2025 21:12
@lkacenja
Copy link
Contributor Author

@allisonmorgan I think this PR is ready for review, but we can hold off on merging it if we want to try to run the main branch evaluation a few more times to build up a larger statistical base.

@allisonmorgan
Copy link
Contributor

These two links look the same to me:

Here are the dashboard links for posterity:
Main
This Branch

I didn't see a saved view for the main branch, so I saved one here.

Copy link
Contributor

@allisonmorgan allisonmorgan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving some comments for now. Want to follow up with you about more testing next week.

type: choice
description: Evaluation Model
options:
- gemini-2.5-flash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this wasn't meant to be committed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops. Time for a repo cleanup. Removed now.

# Add some inferences.
DocumentInference.create(inference_type: "exception:is_application", inference_value: "True", inference_reason: "This is not used as an application or means of participation in government services.", document_id: doc.id)
DocumentInference.create(inference_type: "exception:is_third_party", inference_value: "False", inference_reason: "This is not third party.", document_id: doc.id)
DocumentInference.create(inference_type: "exception:is_archival", inference_value: "True", inference_reason: "This thing was made in 1988 and hasn't been opened since then.", document_id: doc.id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this test change.

{
"metric_name": f"deepeval_mllm_faithfulness:{exception}",
"metric_version": FAITHFULNESS_VERSION,
"metric_version": 2,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at the requirements.txt file and I see that the deepeval library is referring to CFA's fork of it. Just wanted to confirm: this means that if deepeval makes any changes to their definition of the MultimodalFaithfulnessMetric, our definition (and the metric version) shouldn't change right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our definition (and the metric version) shouldn't change right?

Right, the dependency is frozen, unless we update the fork. They finally merged the PR I made that fork to work on. I've been meaning to return this to the official repository. This seems like a good time to do so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, let's do this on a separate effort. I created a ticket for it here and put it at the top of the heap.

@lkacenja
Copy link
Contributor Author

I didn't see a saved view for the main branch, so I saved one here.

Thanks for creating the hex link for main!

@allisonmorgan allisonmorgan self-requested a review July 22, 2025 18:55
Copy link
Contributor

@allisonmorgan allisonmorgan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Glad we found that running the experiment more served to stabilize the results and confirmed that the deepeval changes aren't qualitatively different.

@lkacenja lkacenja merged commit 487434b into dev Jul 22, 2025
204 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants