Refactor converter to use pyvips streaming and multiprocessing for large files #159
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi there,
Thanks for the encouragement to create a PR on this! This pull request addresses the core issue of memory errors (like
loci.formats.FormatException: Image plane too large) that occur when trying to convert very large Whole-Slide Images usingBioFormatslibrary.The main changes are:
Converter.pyhas been refactored to usepyvips's streaming capabilities (access="sequential"). Instead of loading the entire image into RAM, it now processes the file in chunks. This completely resolves the memory bottleneck and allows for the conversion of arbitrarily large files.multiprocessinglibrary into theprocess_allmethod. This allows the script to leverage multiple CPU cores to process files in parallel, dramatically reducing the time required to convert a large directory of images.Important Note on a Design Choice:
In implementing this, I've focused on making the primary use case (handling large WSI files like
.svs,.ndpi, etc.) as robust and efficient as possible. To simplify the logic and dependencies, I have removed the fallback mechanism that usedBioFormatsSlideReader.My reasoning is that I'm not very familiar with the
Bio-Formatslibrary and, more importantly, I don't have access to many of the files listed inBIOFORMAT_EXTENSIONS(like.ome.tif, .lif, etc.) to properly test and validate a fallback implementation. The currentpyvips-based solution already handles the most common large-file formats exceptionally well.Given this change, I wanted to check with you if this contribution is still desired for the project. I'm happy to discuss this further or make any adjustments you see fit.
Thanks for your consideration!