Skip to content

ZIP entry overwritten when streaming artifact chunks in download_artifact_files_from_response #4349

@detail-app

Description

@detail-app

Summary

  • Context: The download_artifact_files_from_response function downloads artifact files from ZenML's artifact store and packages them into a ZIP file for users to retrieve their ML models, datasets, and other artifacts.
  • Bug: The function calls zipf.writestr() inside a loop, which overwrites the ZIP entry on each iteration instead of appending chunks, resulting in only the last chunk being preserved.
  • Actual vs. expected: For files larger than 8192 bytes (CHUNK_SIZE), the downloaded ZIP file contains only the final chunk (typically <8KB) instead of the complete file, causing 95-99% data loss.
  • Impact: Users downloading artifacts larger than 8KB receive severely corrupted files with massive data loss, breaking ML model deployment, dataset downloads, and artifact retrieval workflows.

Code with bug

# Lines 612-630 in src/zenml/artifacts/utils.py
with zipfile.ZipFile(path, "w", zipfile.ZIP_DEFLATED) as zipf:
    for file in filepaths:
        # Ensure 'file' is a string for path operations
        # and ZIP entry naming
        file_str = (
            file.decode() if isinstance(file, bytes) else file
        )
        file_path = str(Path(artifact.uri) / file_str)
        with artifact_store.open(
            file_path, mode="rb"
        ) as store_file:
            # Use a loop to read and write chunks of the file
            # instead of reading the entire file into memory
            CHUNK_SIZE = 8192
            while True:
                if file_content := store_file.read(CHUNK_SIZE):
                    zipf.writestr(file_str, file_content)  # <-- BUG 🔴 Overwrites on each loop iteration
                else:
                    break

The comment on line 623-624 states the intent is to "read and write chunks of the file instead of reading the entire file into memory", but the implementation incorrectly calls zipf.writestr() multiple times with the same filename, which overwrites the entry each time.

Evidence

Example

Consider a 25,000-byte artifact file being downloaded:

  1. First iteration: Read 8,192 bytes (chunk 1: 'X' * 8192), call zipf.writestr("model.pkl", chunk1) → ZIP contains 8,192 bytes
  2. Second iteration: Read 8,192 bytes (chunk 2: 'X' * 8192), call zipf.writestr("model.pkl", chunk2) → ZIP entry overwritten, now contains 8,192 bytes (just chunk 2)
  3. Third iteration: Read 8,192 bytes (chunk 3: 'Y' * 8192), call zipf.writestr("model.pkl", chunk3) → ZIP entry overwritten again, now contains 8,192 bytes (just chunk 3)
  4. Fourth iteration: Read 424 bytes (chunk 4: 'Z' * 424), call zipf.writestr("model.pkl", chunk4) → ZIP entry overwritten final time, now contains only 424 bytes

Result: User downloads a ZIP containing only 424 bytes (the last chunk) instead of the complete 25,000-byte file, resulting in 98.3% data loss.

Failing test

Test script

#!/usr/bin/env python3
"""
Test demonstrating the bug in download_artifact_files_from_response.

This test creates a realistic scenario with artifact files larger than 8192 bytes
and shows that the current implementation loses data when downloading artifacts.
"""

import os
import tempfile
import zipfile
from pathlib import Path


def simulate_buggy_download():
    """Simulate the buggy download_artifact_files_from_response behavior."""

    # Create a temporary directory structure like an artifact store would have
    artifact_dir = tempfile.mkdtemp()

    try:
        # Create a test file with content larger than CHUNK_SIZE (8192 bytes)
        # This simulates a realistic artifact file
        test_file_path = os.path.join(artifact_dir, "model.pkl")
        test_content = b"X" * 10000 + b"Y" * 10000 + b"Z" * 5000  # 25,000 bytes

        with open(test_file_path, 'wb') as f:
            f.write(test_content)

        print(f"Created test artifact file: {test_file_path}")
        print(f"Original file size: {len(test_content)} bytes")
        print(f"Original file content hash: {hash(test_content)}")

        # Simulate the buggy code from download_artifact_files_from_response
        zip_path = tempfile.mktemp(suffix='.zip')

        with zipfile.ZipFile(zip_path, "w", zipfile.ZIP_DEFLATED) as zipf:
            # This is exactly how the buggy code works
            file_str = "model.pkl"
            file_path = test_file_path

            with open(file_path, mode="rb") as store_file:
                CHUNK_SIZE = 8192
                chunk_count = 0
                while True:
                    if file_content := store_file.read(CHUNK_SIZE):
                        chunk_count += 1
                        print(f"  Chunk {chunk_count}: {len(file_content)} bytes, first byte: {file_content[0]:c}")
                        # BUG: This overwrites the entry each time!
                        zipf.writestr(file_str, file_content)
                    else:
                        break

        print(f"\nTotal chunks read: {chunk_count}")
        print(f"Expected: All {chunk_count} chunks should be in the ZIP file")

        # Now extract and verify
        with zipfile.ZipFile(zip_path, "r") as zipf:
            extracted_content = zipf.read("model.pkl")

        print(f"\nExtracted file size: {len(extracted_content)} bytes")
        print(f"Extracted content hash: {hash(extracted_content)}")

        if extracted_content == test_content:
            print("\n✓ SUCCESS: File was downloaded correctly")
            return True
        else:
            print(f"\n✗ BUG DETECTED: Data loss occurred!")
            print(f"  Expected {len(test_content)} bytes, got {len(extracted_content)} bytes")
            print(f"  Data loss: {len(test_content) - len(extracted_content)} bytes ({100 * (len(test_content) - len(extracted_content)) / len(test_content):.1f}%)")
            print(f"  First byte should be 'X' ({ord('X')}), got '{chr(extracted_content[0])}' ({extracted_content[0]})")

            # The last chunk should be preserved
            if extracted_content[0] == ord('Z'):
                print(f"  ⚠ Only the LAST chunk was preserved (starts with 'Z')")

            return False

    finally:
        # Cleanup
        if os.path.exists(artifact_dir):
            import shutil
            shutil.rmtree(artifact_dir)
        if os.path.exists(zip_path):
            os.unlink(zip_path)


if __name__ == "__main__":
    print("=" * 70)
    print("Testing download_artifact_files_from_response with realistic data")
    print("=" * 70)
    print()

    result = simulate_buggy_download()

    print("\n" + "=" * 70)
    if result:
        print("RESULT: No bug detected (unexpected)")
    else:
        print("RESULT: BUG CONFIRMED - Data loss in artifact downloads!")
        print("\nIMPACT:")
        print("  - Users downloading artifacts larger than 8192 bytes will get")
        print("    corrupted/truncated files containing only the last chunk")
        print("  - This affects ML models, datasets, and other large artifacts")
        print("  - Silent data corruption - no error is raised")
    print("=" * 70)

Test output

======================================================================
Testing download_artifact_files_from_response with realistic data
======================================================================

Created test artifact file: /tmp/tmpp142h27a/model.pkl
Original file size: 25000 bytes
Original file content hash: 8802438793657415455
  Chunk 1: 8192 bytes, first byte: X
  Chunk 2: 8192 bytes, first byte: X
  Chunk 3: 8192 bytes, first byte: Y
  Chunk 4: 424 bytes, first byte: Z

Total chunks read: 4
Expected: All 4 chunks should be in the ZIP file

Extracted file size: 424 bytes
Extracted content hash: 7089535107576022998

✗ BUG DETECTED: Data loss occurred!
  Expected 25000 bytes, got 424 bytes
  Data loss: 24576 bytes (98.3%)
  First byte should be 'X' (88), got 'Z' (90)
  ⚠ Only the LAST chunk was preserved (starts with 'Z')

======================================================================
RESULT: BUG CONFIRMED - Data loss in artifact downloads!

IMPACT:
  - Users downloading artifacts larger than 8192 bytes will get
    corrupted/truncated files containing only the last chunk
  - This affects ML models, datasets, and other large artifacts
  - Silent data corruption - no error is raised
======================================================================
/root/.pyenv/versions/3.12.10/lib/python3.12/zipfile/__init__.py:1611: UserWarning: Duplicate name: 'model.pkl'
  return self._open_to_write(zinfo, force_zip64=force_zip64)

Note the Python warning: UserWarning: Duplicate name: 'model.pkl' - this is Python's zipfile module warning that the same filename is being written multiple times.

Full context

The download_artifact_files_from_response function is part of ZenML's artifact management system. It enables users to download their ML artifacts (models, datasets, etc.) from the artifact store to their local machine as a ZIP file.

The function is called by:

  • ArtifactVersionResponse.download_files() method in src/zenml/models/v2/core/artifact_version.py:507 - This is the public API users call to download artifacts

The function works by:

  1. Getting the artifact store associated with the artifact
  2. Listing all files in the artifact's URI using artifact_store.listdir()
  3. For each file, opening it from the artifact store and reading it in chunks
  4. Writing the chunks to a ZIP file

The chunking approach was intentionally designed to avoid loading large files entirely into memory (as stated in the comment on lines 623-624). However, the implementation is incorrect because zipfile.ZipFile.writestr() creates a complete entry in the ZIP file - calling it multiple times with the same filename overwrites the previous entry rather than appending data.

External documentation

ZipFile.writestr(zinfo_or_arcname, data, compress_type=None, compresslevel=None)

Write a file into the archive. The contents is 'data', which may be either
a 'str' or a 'bytes' instance; if it is a 'str', it is encoded as UTF-8 first.
'zinfo_or_arcname' is either a ZipInfo instance or the name of the file in
the archive.

The documentation makes it clear that writestr writes a complete file entry with the provided data. It does not append to an existing entry.

ZipFile.write(filename, arcname=None, compress_type=None, compresslevel=None)

Write the file named filename to the archive, giving it the archive name
arcname (by default, this will be the same as filename, but without a drive
letter and with leading path separators removed).

The write() method is the proper way to add files from disk to a ZIP archive without loading them entirely into memory - the implementation should use this method instead.

Why has this bug gone undetected?

This bug has gone undetected for several reasons:

  1. Test data is too small: The existing integration test (test_download_artifact_files_from_response in tests/integration/functional/artifacts/test_utils.py:273) only validates downloading an artifact containing the single character "7" (1 byte). Since this is far below the 8192-byte chunk size, it never triggers the chunking loop, and the bug remains hidden.

  2. Silent failure: The function doesn't raise any exceptions - it silently produces corrupted output. Python's zipfile module does emit a warning (UserWarning: Duplicate name), but warnings are typically not visible in production and don't cause tests to fail.

  3. Limited real-world usage: The download functionality may not be heavily used yet, or users downloading small artifacts (< 8KB) wouldn't notice the issue. Additionally, users who encounter corrupted downloads might attribute the problem to network issues or other causes rather than reporting it as a ZenML bug.

  4. Recent introduction: The bug was introduced in commit 72dbb31 (PR Add download_files method for ArtifactVersion #2434, merged February 15, 2024), so it's less than a year old and may not have had extensive real-world exposure yet.

  5. Specific conditions required: The bug only manifests when:

    • Users explicitly call the download_files() method on an artifact
    • The artifact contains files larger than 8192 bytes
    • Users attempt to use the downloaded files

Recommended fix

Replace the chunked writestr approach with either:

Option 1: Collect all chunks in memory, then write once (simple but uses more memory):

with artifact_store.open(file_path, mode="rb") as store_file:
    CHUNK_SIZE = 8192
    chunks = []
    while True:
        if file_content := store_file.read(CHUNK_SIZE):
            chunks.append(file_content)  # <-- FIX 🟢 Collect chunks
        else:
            break
    zipf.writestr(file_str, b''.join(chunks))  # <-- FIX 🟢 Write once

Option 2: Use a temporary file (memory efficient for very large files):

import tempfile

with artifact_store.open(file_path, mode="rb") as store_file:
    with tempfile.NamedTemporaryFile(delete=False) as temp_file:
        CHUNK_SIZE = 8192
        while True:
            if file_content := store_file.read(CHUNK_SIZE):
                temp_file.write(file_content)  # <-- FIX 🟢 Write to temp file
            else:
                break
        temp_file.flush()
        zipf.write(temp_file.name, arcname=file_str)  # <-- FIX 🟢 Add temp file to ZIP
    os.unlink(temp_file.name)

Option 1 is simpler and acceptable for most artifact sizes. Option 2 is more memory-efficient for very large artifacts but requires additional temporary file management.

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions