-
Notifications
You must be signed in to change notification settings - Fork 7
Use DeepEval faithfulness metric and remove custom version. #219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@allisonmorgan I think this PR is ready for review, but we can hold off on merging it if we want to try to run the main branch evaluation a few more times to build up a larger statistical base. |
|
These two links look the same to me:
I didn't see a saved view for the main branch, so I saved one here. |
allisonmorgan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving some comments for now. Want to follow up with you about more testing next week.
| type: choice | ||
| description: Evaluation Model | ||
| options: | ||
| - gemini-2.5-flash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this wasn't meant to be committed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops. Time for a repo cleanup. Removed now.
| # Add some inferences. | ||
| DocumentInference.create(inference_type: "exception:is_application", inference_value: "True", inference_reason: "This is not used as an application or means of participation in government services.", document_id: doc.id) | ||
| DocumentInference.create(inference_type: "exception:is_third_party", inference_value: "False", inference_reason: "This is not third party.", document_id: doc.id) | ||
| DocumentInference.create(inference_type: "exception:is_archival", inference_value: "True", inference_reason: "This thing was made in 1988 and hasn't been opened since then.", document_id: doc.id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this test change.
| { | ||
| "metric_name": f"deepeval_mllm_faithfulness:{exception}", | ||
| "metric_version": FAITHFULNESS_VERSION, | ||
| "metric_version": 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at the requirements.txt file and I see that the deepeval library is referring to CFA's fork of it. Just wanted to confirm: this means that if deepeval makes any changes to their definition of the MultimodalFaithfulnessMetric, our definition (and the metric version) shouldn't change right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our definition (and the metric version) shouldn't change right?
Right, the dependency is frozen, unless we update the fork. They finally merged the PR I made that fork to work on. I've been meaning to return this to the official repository. This seems like a good time to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, let's do this on a separate effort. I created a ticket for it here and put it at the top of the heap.
Thanks for creating the hex link for main! |
allisonmorgan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Glad we found that running the experiment more served to stabilize the results and confirmed that the deepeval changes aren't qualitatively different.
DeepEval added a multimodal faithfulness metric while we weren't looking. We created a custom one and used it in our LLM-based summary metric and stand alone in the exception check suite. We should replace these usages with the DeepEval version.
This PR removes the custom faithfulness metric and replaces it with the DeepEval version. We did some fairly robust testing with the evaluation framework this time. This branch produced about a 2% drop in overall accuracy as compared to main at the same time. The changes in accuracy did not seem to be related to the faithfulness metric. We think the dip represents overall variability in the evaluation suite, perhaps related to sample size.
Here are the dashboard links for posterity:
Main
This Branch
Standard docker compose rebuild and up. Don't forget to add the API keys.
No