-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Fix internal error in sort when hitting memory limit #15692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
50f7266
to
ae0b6a5
Compare
squash! Fix clone_on_ref_ptr Clippy warning squash! Move test import to test module squash! Be more explicit about triggering config values
let result = self.reservation.try_grow(size); | ||
if result.is_err() { | ||
if self.in_mem_batches.is_empty() { | ||
return result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@2010YOUY01 since you commented on the issue #15675 (comment) with suggestions:
I think the fix (for a more user-friendly error message) would be
- When sort_spill_reservation_byte reservation failed, the error message should include considering set the memory limit larger than this reservation memory size
- For the code path triggered this internal error, return an ExecutionError instead with a similar error message
This code now emits a DataFusionError::ResourcesExhausted
instead of the suggested DataFusionError::Execution
. I think this makes sense, given the descriptions of the errors.
E.g. Execution error doesn't seem to apply:
This error is returned when an error happens during execution due to a malformed input.
but ResourcesExhausted does:
This error is thrown when a consumer cannot acquire additional memory
I still like the idea of suggesting to bump the memory limit in the error message.
However, since the code simply bubbles up the error emitted by the memory pool's insufficient_capacity_err
macro, we would either need to map it, or change the macro's message.
I don't think mapping makes sense here because the error is generally used for reservation-related issues - I don't think this is a special case.
Changing the error message would probably require a separate discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense that we should preserve the raw resource error. In this case, ContextError
comes in handy:
datafusion/datafusion/common/src/error.rs
Line 132 in 0b01fdf
Context(String, Box<DataFusionError>), |
It can add extra notes to the original error message, to make it more understandable.
Here, the extra note can be like "When the memory limit is reached and there are no buffered batches to spill, external sort cannot continue. Consider setting a higher memory limit."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! Thanks a lot for the reference 🙏
I added some context with d4e654c
I dropped the detail about buffered batches because I figured that it's probably not relevant to the user. Regardless of whether in-memory batches could be spilled or not, the actionable info is to increase the memory limit. This also allowed me to reuse the same context in a14e585 (and 0ee1e8e for completeness). This means all failing memory reservations in the external sorter are wrapped with context. Let me know if you don't think these are necessary.
I also added a suggestion to decrease sort_spill_reservation_bytes
instead. But I'm not convinced this is good advice to the user. I can leave it as a comment instead 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this rapid fix! I left some comments, and would love to hear your take.
let result = self.reservation.try_grow(size); | ||
if result.is_err() { | ||
if self.in_mem_batches.is_empty() { | ||
return result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense that we should preserve the raw resource error. In this case, ContextError
comes in handy:
datafusion/datafusion/common/src/error.rs
Line 132 in 0b01fdf
Context(String, Box<DataFusionError>), |
It can add extra notes to the original error message, to make it more understandable.
Here, the extra note can be like "When the memory limit is reached and there are no buffered batches to spill, external sort cannot continue. Consider setting a higher memory limit."
Signed-off-by: DerGut <[email protected]> squash! Use internal_err macro
Signed-off-by: DerGut <[email protected]>
Signed-off-by: DerGut <[email protected]>
Signed-off-by: DerGut <[email protected]>
DataFusionError::ResourcesExhausted(_) => e.context(format!( | ||
"Not enough memory to continue external sort. \ | ||
Consider increasing the memory limit, or decreasing sort_spill_reservation_bytes" | ||
)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I match on the result of this function will it still be resources exhausted or different type, if different type this can break usages that depend on that, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point! It will be a different type (or enum variant), that's also why the test is currently failing (see https://github.com/apache/datafusion/pull/15692/files#r2041251481)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would actually be a good argument for adding a function to DataFusionError
that allows you to safely compare against any of its nested errors (I'm thinking something similar to Go's errors.Is
).
Without such an API, it would never be safe to just wrap an error with context 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's better to have a utility to assert an error's root cause, like
Context(Context(ResourceExhausted, ...
's root cause is ResourceExhanusted
, however, I think this is not required for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me :) CI is passing again and I'l create a separate issue 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created this: #15713
|
||
let err = result.unwrap_err(); | ||
assert!( | ||
matches!(err, DataFusionError::ResourcesExhausted(_)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, this is now failing because it gets a DataFusionError::Context
instead. I'm still pondering whether to unwrap unnest the error here, or add something like an DataFusionError::is(&self, target: &DataFusionError) -> bool
to match the nested error.
I don't think matching against Context
instead would be interesting here. I'd rather test the cause of the error, less so whether it has added context or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made it pass by doing the unnesting in-line for now d2a0765. I'll create a separate issue to discuss a potential API extension of DataFusionError
.
Thank you, it looks good to me 🙏🏼 |
Signed-off-by: Jannik Steinmann <[email protected]>
Which issue does this PR close?
Rationale for this change
I noticed an internal error in the sort implementation. This PR is exposing it as a proper user-facing error instead.
What changes are included in this PR?
If a user hits this condition, they now receive a
DataFusionError:ResourcesExhausted
instead ofDataFusionError::Internal
.Are these changes tested?
Yes.
Are there any user-facing changes?
In away, the code now exposes a different error to the user facing API. However, since the old error was internal, this was a bug and I don't expect any documentation changes are necessary.