Batch evaluations #111

jacobreesmontgomery · 2025-07-25T14:19:02Z

jacobreesmontgomery
Jul 25, 2025

Hi! I have a question regarding batch evaluations.

Let's say I want to run a batch evaluation on a set of questions using two evaluators: the ResponseCompletenessEvaluator, which requires parameters response and ground_truth, and the GroundednessEvaluator, which requires query, context, and response. Furthermore, let's say that ALL Q&A results, to be fed into the batch evaluation, meet the requirements for groundednes but not response completeness (i.e., not all questions have the ground truth). Would the batch evaluation be dynamic enough to handle this scenario correctly? In other words, can I expand the below evaluator_config to include ground_truth and see all questions receive a groundedness evaluation, with those having ground truth also receiving the response completeness evaluation? Or, would I have to run two separate batch evaluations, one for groundedness and another for completeness?

Answered by nitya

Aug 4, 2025

Hi @jacobreesmontgomery - thanks for your patience!! I talked to the engineer on the team and we ran some tests to see if the issue you experienced was reproducible. It was. And the behavior may be the result of a different change - so their recommendation was to have you file this as an issue.

What we saw then reproducing your issue:

If you have a batch evaluation with multiple evaluators - and your data row is missing values for parameters required by a specific evaluator - then the evaluation RUN will complete, but the relevant evaluators will show failures in logs.
If the ground truth in the data was represented by an empty string, it scores that a 1. But if it resolved to a None …

View full answer

nitya · 2025-07-29T18:09:36Z

nitya
Jul 29, 2025
Maintainer

Hey @jacobreesmontgomery thanks for asking the question!! This can also help us in future documentation.

Evaluators do have a data validation step to check if the required data values are specified before evaluating that row. In a batch evaluation, my understanding is that if a specific row fails, the rest of the evaluation will still continue.

However, we have reached out to the Evaluations SDK team to confirm if this is the case. We hope to have a response to you by tomorrow. If you have any other clarifications to add, please do so.

1 reply

jacobreesmontgomery Aug 4, 2025
Author

Okay, thank you!

Any updates on this front?

One problem I've found is that, in running the above scenario, the response completeness is heavily skewed by the Q&A sets where the ground truth was not provided. I would think that it should ignore the scoring of response completeness for those sets but instead it gives a rating of one.

nitya · 2025-08-04T23:09:44Z

nitya
Aug 4, 2025
Maintainer

Hi @jacobreesmontgomery - thanks for your patience!! I talked to the engineer on the team and we ran some tests to see if the issue you experienced was reproducible. It was. And the behavior may be the result of a different change - so their recommendation was to have you file this as an issue.

What we saw then reproducing your issue:

If you have a batch evaluation with multiple evaluators - and your data row is missing values for parameters required by a specific evaluator - then the evaluation RUN will complete, but the relevant evaluators will show failures in logs.
If the ground truth in the data was represented by an empty string, it scores that a 1. But if it resolved to a None (missing value) and the evaluator would register a failure.

Recommendation:
Submit it as an issue to the Azure Python SDK / Evaluations (see repo link below) - and specify which version of the package you were using. This gets routed to the same engineering team and they can take a look and ask for clarifications

Repo (and current evaluation issues) here:
https://github.com/Azure/azure-sdk-for-python/issues?q=is%3Aissue%20state%3Aopen%20label%3AEvaluation

2 replies

jacobreesmontgomery Aug 5, 2025
Author

Hi @nitya , thank you for your reply! I have submitted an issue here.

nitya Aug 5, 2025
Maintainer

Thanks @jacobreesmontgomery for the discussion and for submitting the issue. I will close the loop internally as well. Appreciate this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure AI Foundry

Batch evaluations #111

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Azure AI Foundry

Batch evaluations #111

Uh oh!

Uh oh!

jacobreesmontgomery Jul 25, 2025

Replies: 2 comments · 3 replies

Uh oh!

nitya Jul 29, 2025 Maintainer

Uh oh!

jacobreesmontgomery Aug 4, 2025 Author

Uh oh!

nitya Aug 4, 2025 Maintainer

Uh oh!

jacobreesmontgomery Aug 5, 2025 Author

Uh oh!

nitya Aug 5, 2025 Maintainer

jacobreesmontgomery
Jul 25, 2025

Replies: 2 comments 3 replies

nitya
Jul 29, 2025
Maintainer

jacobreesmontgomery Aug 4, 2025
Author

nitya
Aug 4, 2025
Maintainer

jacobreesmontgomery Aug 5, 2025
Author

nitya Aug 5, 2025
Maintainer