Skip to content

Fix internal error in sort when hitting memory limit #15692

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

DerGut
Copy link
Contributor

@DerGut DerGut commented Apr 12, 2025

Which issue does this PR close?

Rationale for this change

I noticed an internal error in the sort implementation. This PR is exposing it as a proper user-facing error instead.

What changes are included in this PR?

If a user hits this condition, they now receive a DataFusionError:ResourcesExhausted instead of DataFusionError::Internal.

Are these changes tested?

Yes.

Are there any user-facing changes?

In away, the code now exposes a different error to the user facing API. However, since the old error was internal, this was a bug and I don't expect any documentation changes are necessary.

@DerGut DerGut force-pushed the main branch 2 times, most recently from 50f7266 to ae0b6a5 Compare April 12, 2025 15:47
DerGut added 3 commits April 12, 2025 17:48
squash! Fix clone_on_ref_ptr Clippy warning

squash! Move test import to test module

squash! Be more explicit about triggering config values
let result = self.reservation.try_grow(size);
if result.is_err() {
if self.in_mem_batches.is_empty() {
return result;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@2010YOUY01 since you commented on the issue #15675 (comment) with suggestions:

I think the fix (for a more user-friendly error message) would be

  1. When sort_spill_reservation_byte reservation failed, the error message should include considering set the memory limit larger than this reservation memory size
  2. For the code path triggered this internal error, return an ExecutionError instead with a similar error message

This code now emits a DataFusionError::ResourcesExhausted instead of the suggested DataFusionError::Execution. I think this makes sense, given the descriptions of the errors.

E.g. Execution error doesn't seem to apply:

This error is returned when an error happens during execution due to a malformed input.

but ResourcesExhausted does:

This error is thrown when a consumer cannot acquire additional memory


I still like the idea of suggesting to bump the memory limit in the error message.

However, since the code simply bubbles up the error emitted by the memory pool's insufficient_capacity_err macro, we would either need to map it, or change the macro's message.

I don't think mapping makes sense here because the error is generally used for reservation-related issues - I don't think this is a special case.
Changing the error message would probably require a separate discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense that we should preserve the raw resource error. In this case, ContextError comes in handy:

Context(String, Box<DataFusionError>),

It can add extra notes to the original error message, to make it more understandable.

Here, the extra note can be like "When the memory limit is reached and there are no buffered batches to spill, external sort cannot continue. Consider setting a higher memory limit."

Copy link
Contributor Author

@DerGut DerGut Apr 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Thanks a lot for the reference 🙏

I added some context with d4e654c

I dropped the detail about buffered batches because I figured that it's probably not relevant to the user. Regardless of whether in-memory batches could be spilled or not, the actionable info is to increase the memory limit. This also allowed me to reuse the same context in a14e585 (and 0ee1e8e for completeness). This means all failing memory reservations in the external sorter are wrapped with context. Let me know if you don't think these are necessary.

I also added a suggestion to decrease sort_spill_reservation_bytes instead. But I'm not convinced this is good advice to the user. I can leave it as a comment instead 🤷

@DerGut DerGut marked this pull request as ready for review April 12, 2025 17:12
Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this rapid fix! I left some comments, and would love to hear your take.

let result = self.reservation.try_grow(size);
if result.is_err() {
if self.in_mem_batches.is_empty() {
return result;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense that we should preserve the raw resource error. In this case, ContextError comes in handy:

Context(String, Box<DataFusionError>),

It can add extra notes to the original error message, to make it more understandable.

Here, the extra note can be like "When the memory limit is reached and there are no buffered batches to spill, external sort cannot continue. Consider setting a higher memory limit."

Comment on lines 801 to 804
DataFusionError::ResourcesExhausted(_) => e.context(format!(
"Not enough memory to continue external sort. \
Consider increasing the memory limit, or decreasing sort_spill_reservation_bytes"
)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I match on the result of this function will it still be resources exhausted or different type, if different type this can break usages that depend on that, no?

Copy link
Contributor Author

@DerGut DerGut Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point! It will be a different type (or enum variant), that's also why the test is currently failing (see https://github.com/apache/datafusion/pull/15692/files#r2041251481)

Copy link
Contributor Author

@DerGut DerGut Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would actually be a good argument for adding a function to DataFusionError that allows you to safely compare against any of its nested errors (I'm thinking something similar to Go's errors.Is).

Without such an API, it would never be safe to just wrap an error with context 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's better to have a utility to assert an error's root cause, like
Context(Context(ResourceExhausted, ...'s root cause is ResourceExhanusted, however, I think this is not required for this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me :) CI is passing again and I'l create a separate issue 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created this: #15713


let err = result.unwrap_err();
assert!(
matches!(err, DataFusionError::ResourcesExhausted(_)),
Copy link
Contributor Author

@DerGut DerGut Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, this is now failing because it gets a DataFusionError::Context instead. I'm still pondering whether to unwrap unnest the error here, or add something like an DataFusionError::is(&self, target: &DataFusionError) -> bool to match the nested error.
I don't think matching against Context instead would be interesting here. I'd rather test the cause of the error, less so whether it has added context or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it pass by doing the unnesting in-line for now d2a0765. I'll create a separate issue to discuss a potential API extension of DataFusionError.

@2010YOUY01
Copy link
Contributor

Thank you, it looks good to me 🙏🏼
Let's make the CI pass, I think we can change the assertion type for simplicity here, and do a separate PR for this utility DataFusionError::is(&self, target: &DataFusionError) -> bool if you prefer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Internal error in ExternalSorter when running with memory limit
3 participants