Skip to content

Improve ZIP store performance and filesystem listing#70

Open
kulvait wants to merge 5 commits intozarr-developers:mainfrom
kulvait:zip-clean
Open

Improve ZIP store performance and filesystem listing#70
kulvait wants to merge 5 commits intozarr-developers:mainfrom
kulvait:zip-clean

Conversation

@kulvait
Copy link
Copy Markdown

@kulvait kulvait commented Apr 1, 2026

Motivation for this work was that on large zip stores, the indexing was very slow and then opening arrays as well.
So I did replaced Apache Commons ZipArchiveInputStream by random access java class.
Change in FilesystemStore.java is more cosmetic but also slightly increase performance.

  • Replace stream-based ZIP access with random-access ZipFile for faster and more efficient reads
  • Optimize ZIP store initialization by reading the ZIP index only, avoiding full stream traversal
  • Introduce improved caching of directory structure and file sizes using synchronized maps
  • Simplify internal logic and remove unnecessary dependencies
  • Improve filesystem listing performance

Details:

  • Removed dependency on Apache Commons ZipArchiveInputStream and related classes in ReadOnlyZipStore
  • Added efficient caching using maps for directories and file sizes, async friendly
  • Normalized entry names for consistent lookup
  • Reduced redundant computations (e.g. cached entry sizes)
  • Simplified stream handling and chunk calculation

These changes significantly improve performance, reduce complexity, and make the ZIP store implementation more maintainable.

kulvait added 2 commits April 1, 2026 18:34
- Replace stream-based ZIP access with random-access `ZipFile` for faster and more efficient reads
- Optimize ZIP store initialization by reading the ZIP index only, avoiding full stream traversal
- Introduce improved caching of directory structure and file sizes using synchronized maps
- Simplify internal logic and remove unnecessary dependencies
- Improve filesystem listing performance

Details:
- Removed dependency on Apache Commons ZipArchiveInputStream and related classes in ReadOnlyZipStore
- Added efficient caching using maps for directories and file sizes, async friendly
- Normalized entry names for consistent lookup
- Reduced redundant computations (e.g. cached entry sizes)
- Simplified stream handling and chunk calculation

These changes significantly improve performance, reduce complexity,
and make the ZIP store implementation more maintainable.
… and directory existence

- Problem:
  * Some tests failed because ReadOnlyZipStore could not locate ZIP entries when the store
    was created by simply zipping a directory. Tools differ: some produce entry names with
    a leading slash, others without. This caused getInputStream(), read(), and getSize() to return null.
  * testExists() failed for root keys because exists() included directories, but by design it should only test file existence.

- Solution:
  * Added resolvePathWithLeadingSlashFromKeys() to try a secondary lookup with a leading slash.
    Primary lookup uses the standard key without leading slash; secondary is only for compatibility.
  * Modified exists(String[] keys) to check only fileSizeIndex, ignoring directories, to match the intended design.

- Effect:
  * ReadOnlyZipStoreTest passes regardless of whether ZIP entries have leading slashes or not.
  * Test logic now clearly distinguishes between file existence and directory entries.
@kulvait
Copy link
Copy Markdown
Author

kulvait commented Apr 16, 2026

I have updated the PR so it fixes failed workflow and tests. Can you please run it again?

This reverts a previous attempt to simplify the FilesystemStore#list
implementation using a parallel stream and String-based path splitting:

    Files.walk(rootPath)
         .filter(Files::isRegularFile)
         .parallel()
         .map(path -> rootPath.relativize(path)
             .toString()
             .split(File.separator));

While this approach showed performance improvements in some scenarios
(likely due to .parallel() decoupling filesystem traversal from mapping),
it introduced a platform-specific issue: String.split() expects a regex,
and File.separator ("\" on Windows) leads to a PatternSyntaxException.

A correct fix would require quoting the separator, e.g.:

    split((java.util.regex.Pattern.quote(File.separator))

However, this change is outside the scope of the current PR, which focuses
on ZipStore improvements rather than altering core FilesystemStore behavior.

Revert to the upstream commit c9a5ee1 for correctness and consistency.
As a follow-up, it may be worth evaluating the use of .parallel() separately,
as it appears to provide measurable benefits in some workloads.

Change of ReadOnlyZipStore.java is formatting only
@kulvait
Copy link
Copy Markdown
Author

kulvait commented Apr 17, 2026

Due to remaining Windows test failures, I reverted FilesystemStore to the upstream implementation, see commit message. This change should restore cross-platform compatibility and allow all tests to pass. Please approve workflow.

kulvait added 2 commits April 23, 2026 13:34
…resource leaks

Both get(...) and getInputStream(...) now delegate to a shared helper method
that reads ZIP entry data into a byte[]. This refactor removes duplicated ZIP handling logic (entry lookup, range
validation, skip/read loop) and ensures consistent behavior across both APIs.

Previous implementation of getInputStream() returned an InputStream
backed by a ZipFile entry. If callers failed to close the stream, as is
a case in one of the tests in StoreTest.java, on Windows this caused
file locking issues, wchich prevented deletion of the underlying ZIP
file during tests.
- Clean up debugging and temporary logging statements
- Rename ensureCacheNew to buildZipIndex for clarity
- Replace synchronized HashSet wrappers with ConcurrentHashMap.newKeySet()
  to improve performance and reduce locking overhead during indexing
@kulvait
Copy link
Copy Markdown
Author

kulvait commented Apr 23, 2026

I hope this PR will be considered for merge, as I put a lot of effort into addressing issues that were triggering test failures in CI. The changes also meaningfully improve the performance and efficiency of the ReadOnlyZipStore implementation.

Copy link
Copy Markdown
Member

@normanrz normanrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!
I think this should be a new class, because this implementation with ZipFile only works on the filesystem (FilesystemStore), whereas the existing ReadOnlyZipStore also works with remote stores (e.g. HttpStore, S3Store).

Comment on lines +63 to +73
// Helper for buildZipIndex to add all parent directories of a given entry to the directory index, ensuring they are present for lookups
private void addParentDirs(String entryName, Set<String> dirIndex) {
int lastSlash = entryName.lastIndexOf('/'); // Find the last '/' in the file name
while (lastSlash > 0) { // Keep going until no more slashes are found
String parentDir = entryName.substring(0, lastSlash + 1); // Extract the parent directory path
if (!dirIndex.add(parentDir)) { // Add the parent to the directory index if it’s not already added
break; // Exit if this parent directory has already been added
}
lastSlash = entryName.lastIndexOf('/', lastSlash - 1); // Move the search for slashes up
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Helper for buildZipIndex to add all parent directories of a given entry to the directory index, ensuring they are present for lookups
private void addParentDirs(String entryName, Set<String> dirIndex) {
int lastSlash = entryName.lastIndexOf('/'); // Find the last '/' in the file name
while (lastSlash > 0) { // Keep going until no more slashes are found
String parentDir = entryName.substring(0, lastSlash + 1); // Extract the parent directory path
if (!dirIndex.add(parentDir)) { // Add the parent to the directory index if it’s not already added
break; // Exit if this parent directory has already been added
}
lastSlash = entryName.lastIndexOf('/', lastSlash - 1); // Move the search for slashes up
}
}

Seems unused

public ByteBuffer get(String[] keys, long start, long end) {
byte[] bytes = readEntryBytes(keys, start, end);
if (bytes == null) {
return ByteBuffer.allocate(0);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return ByteBuffer.allocate(0);
return null;

If a key is not present, null should be returned.

Comment on lines +131 to +136
} else if (entryStrippedPath.startsWith("/")) {
entryStrippedPath = entryStrippedPath.substring(1);
logger.log(Level.WARNING,
"Directory entry '{0}' did start with '/' not removed by normalizeEntryName()",
e.getName());
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else if (entryStrippedPath.startsWith("/")) {
entryStrippedPath = entryStrippedPath.substring(1);
logger.log(Level.WARNING,
"Directory entry '{0}' did start with '/' not removed by normalizeEntryName()",
e.getName());
}
}

Dead code, because normalizeEntryName already strips leading slashes.

return "ReadOnlyZipStore(" + underlyingStore.toString() + ")";
}

public static String[] concatPaths(String[] prefix, String[] child) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public static String[] concatPaths(String[] prefix, String[] child) {
private static String[] concatPaths(String[] prefix, String[] child) {

}


public void addChildrenRecursively(String[] prefixZarrPath, String[] childrenZarrPath, Stream.Builder<String[]> builder) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public void addChildrenRecursively(String[] prefixZarrPath, String[] childrenZarrPath, Stream.Builder<String[]> builder) {
private void addChildrenRecursively(String[] prefixZarrPath, String[] childrenZarrPath, Stream.Builder<String[]> builder) {

private Map<String, Long> fileIndex;
private Set<String> directoryIndex;
private static final Logger logger = Logger.getLogger(ReadOnlyZipStore.class.getName());
final Path zipStorePath; // Store the resolved zip path for logging and potential future use
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
final Path zipStorePath; // Store the resolved zip path for logging and potential future use
private final Path zipStorePath; // Store the resolved zip path for logging and potential future use

}
if (entry.isDirectory() || !entryName.equals(resolveKeys(keys))) {
continue;
try (ZipFile zf = new ZipFile(zipStorePath.toFile())) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the ZipFile instance could be cached? Reopening probably has some overhead.

String name = normalizeEntryName(entry.getName());
if (entry.isDirectory()) {
directoryIndex.add(name);
private synchronized void buildZipIndex() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either synchronized or ConcurrentHashMap should be used. Using both seems unnessecary.
I would lean towards keeping synchronized and using a HashMap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants