-
Notifications
You must be signed in to change notification settings - Fork 12
Parallel writing to shards #311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
aliddell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks OK, just one thing I have a question about.
| Defaults to None. | ||
| scale : Tuple[float], optional | ||
| shards_ratio : tuple[int, ...], optional | ||
| TCZYX shards ratio of the plate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "shards ratio" mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many chunks per shard along each dimension. I think this is easier to use than the exact shard size, since they have to be divisible by the chunk size.
iohub/ngff/utils.py
Outdated
| if output_plate.version == "0.4" and shards_ratio is not None: | ||
| raise ValueError("Sharding is not supported in OME-Zarr version 0.4.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not ignore in this case? Based on the documentation, users should not expect an exception here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emitting something here could help catch some input error. But maybe a warning is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@aliddell test is failing for the recent acquire-zarr release. Do you have a migration guide? |
… API and downsampling behavior
No, but I should write one. I've fixed the fixture and test for now. |
|
@ziw-liu as you outline in the PR description, here you've worked around some bugs in libraries we depend on. Could you leave notes (here or some other place) on what the code would ideally look like once these bugs are resolved - for example, we should be able to use both tensorstore and zarr python to write arrays. |
In this specific case it doesn't really matter to the user that |
|
I'm not really suggesting two code paths here, I'm just looking to document your though process for the next person who'd carry on this work. For example, would the code be better off if we don't do explicit GC or spawn for all platforms when creating multiprocessing? |
* Fix tensorstore empty array handling - Add validation for empty arrays in _save_transformed before tensorstore write - Skip write operations for empty arrays with warning messages - Add comprehensive error handling with detailed diagnostics for tensorstore failures - Improve error messages to include array shapes, sizes, and tensorstore details This resolves the ValueError: Error aligning dimensions issue when empty arrays are passed to tensorstore write operations. * Add empty results check to prevent tensorstore alignment errors Adds validation in apply_transform_to_tczyx_and_save() to check for empty results dictionary before calling _save_transformed(). When no valid time points are available, logs diagnostic message and skips write operation instead of attempting to write empty arrays to tensorstore, which causes alignment dimension mismatches. * Revert "Fix tensorstore empty array handling" This reverts commit 65c9ddb. * better handling of output_time_indices * style
ieivanov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with this PR. I've tested that it correctly saves data in zarr v3 format using biahub concatenate in conjunction with czbiohub-sf/biahub#104 and I've also tested that the changes here don't break existing pipelines (specifically biahub deskew) which for now will continue writing in zarr v2.
|
I'm seeing that |
|
Added an error message, sharding along channel dimensions is not immediately straightforward. First attempt here: https://github.com/czbiohub-sf/iohub/tree/batched_channel_processing, if it's at all possible then |
create_empty_plateto expose sharding.apply_transform_to_tczyx_and_saveto loop within a shard and multi-process across shards withprocess_single_position.