Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Managed Iceberg] Make manifest file writes and commits more efficient #32666

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ahmedabu98
Copy link
Contributor

@ahmedabu98 ahmedabu98 commented Oct 5, 2024

When writing to Iceberg, we need to write just one manifest file per snapshot.

However, we are currently writing one manifest file per bundle (or one per GIB batch for streaming writes), which is a lot more frequent than needed. In medium/large streaming jobs, we can end up with thousands of extra manifest files. For an Iceberg table, the effect of this inefficiency is felt in two ways:

  • Writing more files than necessary
  • During query planning, having to open and read more files than necessary

Solution:
Continue writing bundles/batches to data files, but stop writing manifest files at that cadence. Instead, group data files by destination then write and commit just one manifest file per destination. Essentially, the number of manifest files should be 1-1 with snapshots/commits (currently, it's roughly 1-1 with data files).

@ahmedabu98 ahmedabu98 changed the title [Managed Iceberg] Make file writes and commits more efficient [Managed Iceberg] Make manifest file writes and commits more efficient Oct 5, 2024
Copy link
Contributor

github-actions bot commented Oct 5, 2024

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@ahmedabu98 ahmedabu98 added this to the 2.60.0 Release milestone Oct 6, 2024
@ahmedabu98
Copy link
Contributor Author

Added as a release blocker because these are update-incompatible changes. Streaming writes are going to be officially supported in 2.60.0 so this should get in with it to avoid breaking pipeline update

@ahmedabu98
Copy link
Contributor Author

assign set of reviewers

Copy link
Contributor

github-actions bot commented Oct 6, 2024

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java.
R: @damccorm for label build.
R: @chamikaramj for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant