Skip to content

GH-46710: [C++] Fix ownership and lifetime issues in Dataset Writer #46711

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 12, 2025

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Jun 4, 2025

Rationale for this change

The dataset writer currently uses raw pointers of the pieces of state it needs to track, even though some of this state is accessed in async callbacks that might run at arbitrary points in time (especially in the case an error forces an early return of the synchronously running logic).

What changes are included in this PR?

This PR strives to strengthen the dataset writer code by using safe pointers everywhere possible, and also trying to guard against potential leaks due to cyclic shared_ptr references.

Besides making it potentially more robust, it should also make future maintenance and evolutions easier.

This PR seems to fix #45235, though it will have to be confirmed after multiple CI runs.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@pitrou
Copy link
Member Author

pitrou commented Jun 4, 2025

@github-actions crossbow submit -g cpp

This comment was marked as outdated.

@pitrou
Copy link
Member Author

pitrou commented Jun 10, 2025

@zanmato1984 Would you have time to look at this? I fear that not many devs are able to review this code...

@pitrou
Copy link
Member Author

pitrou commented Jun 10, 2025

Also cc @bkietz

@zanmato1984
Copy link
Contributor

Sorry for being late. I'm actually reviewing this today. Didn't touch much of this part of the code so it may take some time.

@pitrou
Copy link
Member Author

pitrou commented Jun 10, 2025

Thank you! I'll rebase to get a newer baseline for CI.

@pitrou pitrou force-pushed the dataset-writer-memory-safety branch from 44c6f48 to 0117ee4 Compare June 10, 2025 07:55
@pitrou
Copy link
Member Author

pitrou commented Jun 10, 2025

@github-actions crossbow submit -g cpp

Copy link

Revision: 0117ee4

Submitted crossbow builds: ursacomputing/crossbow @ actions-3309186726

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

Comment on lines +230 to +234
auto st = PopAndDeliverStagedBatch().status();
if (!st.ok()) {
file_tasks_.reset();
return st;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto st = PopAndDeliverStagedBatch().status();
if (!st.ok()) {
file_tasks_.reset();
return st;
}
RETURN_NOT_OK_ELSE(PopAndDeliverStagedBatch(), file_tasks_.reset());

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had forgotten about this little-used macro. I honestly find it a bit confusing, I had to look up the definition to ensure I understood the semantics correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the name is confusing. I do believe when given a less confusing name, this can be an actually useful macro. How about we rename it to something like RETURN_NOT_OK_AFTER (in follow-up PR)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I'm not sure RETURN_NOT_OK_AFTER would be much less confusing, honestly.
I wonder if we can add some kind of functional API to Status, for example:

      RETURN_NOT_OK(PopAndDeliverStagedBatch().OrElse([&] { file_tasks_.reset() }));

@@ -377,8 +375,8 @@ class ThrottledAsyncTaskSchedulerImpl
if (maybe_backoff) {
lk.unlock();
if (!maybe_backoff->TryAddCallback([&] {
return [self = shared_from_this()](const Status& st) {
if (st.ok()) {
return [weak_self = weak_from_this()](const Status& st) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

pause_callback_(std::move(pause_callback)),
resume_callback_(std::move(resume_callback)) {}

~DatasetWriterImpl() {
// In case something went wrong (e.g. an IO error occurred), some tasks
// may be left dangling in a ThrottledAsyncTaskScheduler and that may
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are such tasks left dangling exactly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I haven't tried to analyze this very deeply. It might be the file_finish_task actually? In any case, this is necessary to suppress the memory leaks in error cases.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 11, 2025
Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though we are not sure about how exactly this fix addresses the issue, I think it's generally safe and improves the life-cycle safety. +1.

@pitrou
Copy link
Member Author

pitrou commented Jun 12, 2025

Thanks for the review @zanmato1984 !

@pitrou pitrou merged commit 57439e0 into apache:main Jun 12, 2025
46 of 48 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Jun 12, 2025
@pitrou pitrou deleted the dataset-writer-memory-safety branch June 12, 2025 12:48
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 57439e0.

There were 68 benchmark results with an error:

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 18 possible false positives for unstable benchmarks that are known to sometimes produce them.

alinaliBQ pushed a commit to Bit-Quill/arrow that referenced this pull request Jun 17, 2025
…iter (apache#46711)

### Rationale for this change

The dataset writer currently uses raw pointers of the pieces of state it needs to track, even though some of this state is accessed in async callbacks that might run at arbitrary points in time (especially in the case an error forces an early return of the synchronously running logic).

### What changes are included in this PR?

This PR strives to strengthen the dataset writer code by using safe pointers everywhere possible, and also trying to guard against potential leaks due to cyclic shared_ptr references.

Besides making it potentially more robust, it should also make future maintenance and evolutions easier.

This PR seems to fix apache#45235, though it will have to be confirmed after multiple CI runs.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.

* GitHub Issue: apache#46710

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
pitrou added a commit that referenced this pull request Jun 30, 2025
…46792)

### Rationale for this change

In #46711 (comment) it was mentioned that the macro `RETURN_NOT_OK_ELSE` is confusing and can easily be misunderstood. We would like a better way to conditionally chain error-handling code if a Status does not indicate success.

### What changes are included in this PR?

1. Add a type trait `IntoStatus<T>` that can be implemented to provide conversions from other error-like types
2. Add a global `ToStatus` function that calls the aforementioned type trait
3. Add a `Status::OrElse` method that calls a functor on error
4. Remove the `RETURN_NOT_OK_ELSE` macro

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No, the `RETURN_NOT_OK_ELSE` was not supposed to be called by third-party code as it's not prefixed with `ARROW_`.

* GitHub Issue: #46791

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <pitrou@free.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Fix ASAN issue in the arrow-dataset-dataset-writer-test
2 participants