-
Notifications
You must be signed in to change notification settings - Fork 563
Description
Summary
- Context: The
download_artifact_files_from_responsefunction downloads artifact files from ZenML's artifact store and packages them into a ZIP file for users to retrieve their ML models, datasets, and other artifacts. - Bug: The function calls
zipf.writestr()inside a loop, which overwrites the ZIP entry on each iteration instead of appending chunks, resulting in only the last chunk being preserved. - Actual vs. expected: For files larger than 8192 bytes (CHUNK_SIZE), the downloaded ZIP file contains only the final chunk (typically <8KB) instead of the complete file, causing 95-99% data loss.
- Impact: Users downloading artifacts larger than 8KB receive severely corrupted files with massive data loss, breaking ML model deployment, dataset downloads, and artifact retrieval workflows.
Code with bug
# Lines 612-630 in src/zenml/artifacts/utils.py
with zipfile.ZipFile(path, "w", zipfile.ZIP_DEFLATED) as zipf:
for file in filepaths:
# Ensure 'file' is a string for path operations
# and ZIP entry naming
file_str = (
file.decode() if isinstance(file, bytes) else file
)
file_path = str(Path(artifact.uri) / file_str)
with artifact_store.open(
file_path, mode="rb"
) as store_file:
# Use a loop to read and write chunks of the file
# instead of reading the entire file into memory
CHUNK_SIZE = 8192
while True:
if file_content := store_file.read(CHUNK_SIZE):
zipf.writestr(file_str, file_content) # <-- BUG 🔴 Overwrites on each loop iteration
else:
breakThe comment on line 623-624 states the intent is to "read and write chunks of the file instead of reading the entire file into memory", but the implementation incorrectly calls zipf.writestr() multiple times with the same filename, which overwrites the entry each time.
Evidence
Example
Consider a 25,000-byte artifact file being downloaded:
- First iteration: Read 8,192 bytes (chunk 1: 'X' * 8192), call
zipf.writestr("model.pkl", chunk1)→ ZIP contains 8,192 bytes - Second iteration: Read 8,192 bytes (chunk 2: 'X' * 8192), call
zipf.writestr("model.pkl", chunk2)→ ZIP entry overwritten, now contains 8,192 bytes (just chunk 2) - Third iteration: Read 8,192 bytes (chunk 3: 'Y' * 8192), call
zipf.writestr("model.pkl", chunk3)→ ZIP entry overwritten again, now contains 8,192 bytes (just chunk 3) - Fourth iteration: Read 424 bytes (chunk 4: 'Z' * 424), call
zipf.writestr("model.pkl", chunk4)→ ZIP entry overwritten final time, now contains only 424 bytes
Result: User downloads a ZIP containing only 424 bytes (the last chunk) instead of the complete 25,000-byte file, resulting in 98.3% data loss.
Failing test
Test script
#!/usr/bin/env python3
"""
Test demonstrating the bug in download_artifact_files_from_response.
This test creates a realistic scenario with artifact files larger than 8192 bytes
and shows that the current implementation loses data when downloading artifacts.
"""
import os
import tempfile
import zipfile
from pathlib import Path
def simulate_buggy_download():
"""Simulate the buggy download_artifact_files_from_response behavior."""
# Create a temporary directory structure like an artifact store would have
artifact_dir = tempfile.mkdtemp()
try:
# Create a test file with content larger than CHUNK_SIZE (8192 bytes)
# This simulates a realistic artifact file
test_file_path = os.path.join(artifact_dir, "model.pkl")
test_content = b"X" * 10000 + b"Y" * 10000 + b"Z" * 5000 # 25,000 bytes
with open(test_file_path, 'wb') as f:
f.write(test_content)
print(f"Created test artifact file: {test_file_path}")
print(f"Original file size: {len(test_content)} bytes")
print(f"Original file content hash: {hash(test_content)}")
# Simulate the buggy code from download_artifact_files_from_response
zip_path = tempfile.mktemp(suffix='.zip')
with zipfile.ZipFile(zip_path, "w", zipfile.ZIP_DEFLATED) as zipf:
# This is exactly how the buggy code works
file_str = "model.pkl"
file_path = test_file_path
with open(file_path, mode="rb") as store_file:
CHUNK_SIZE = 8192
chunk_count = 0
while True:
if file_content := store_file.read(CHUNK_SIZE):
chunk_count += 1
print(f" Chunk {chunk_count}: {len(file_content)} bytes, first byte: {file_content[0]:c}")
# BUG: This overwrites the entry each time!
zipf.writestr(file_str, file_content)
else:
break
print(f"\nTotal chunks read: {chunk_count}")
print(f"Expected: All {chunk_count} chunks should be in the ZIP file")
# Now extract and verify
with zipfile.ZipFile(zip_path, "r") as zipf:
extracted_content = zipf.read("model.pkl")
print(f"\nExtracted file size: {len(extracted_content)} bytes")
print(f"Extracted content hash: {hash(extracted_content)}")
if extracted_content == test_content:
print("\n✓ SUCCESS: File was downloaded correctly")
return True
else:
print(f"\n✗ BUG DETECTED: Data loss occurred!")
print(f" Expected {len(test_content)} bytes, got {len(extracted_content)} bytes")
print(f" Data loss: {len(test_content) - len(extracted_content)} bytes ({100 * (len(test_content) - len(extracted_content)) / len(test_content):.1f}%)")
print(f" First byte should be 'X' ({ord('X')}), got '{chr(extracted_content[0])}' ({extracted_content[0]})")
# The last chunk should be preserved
if extracted_content[0] == ord('Z'):
print(f" ⚠ Only the LAST chunk was preserved (starts with 'Z')")
return False
finally:
# Cleanup
if os.path.exists(artifact_dir):
import shutil
shutil.rmtree(artifact_dir)
if os.path.exists(zip_path):
os.unlink(zip_path)
if __name__ == "__main__":
print("=" * 70)
print("Testing download_artifact_files_from_response with realistic data")
print("=" * 70)
print()
result = simulate_buggy_download()
print("\n" + "=" * 70)
if result:
print("RESULT: No bug detected (unexpected)")
else:
print("RESULT: BUG CONFIRMED - Data loss in artifact downloads!")
print("\nIMPACT:")
print(" - Users downloading artifacts larger than 8192 bytes will get")
print(" corrupted/truncated files containing only the last chunk")
print(" - This affects ML models, datasets, and other large artifacts")
print(" - Silent data corruption - no error is raised")
print("=" * 70)Test output
======================================================================
Testing download_artifact_files_from_response with realistic data
======================================================================
Created test artifact file: /tmp/tmpp142h27a/model.pkl
Original file size: 25000 bytes
Original file content hash: 8802438793657415455
Chunk 1: 8192 bytes, first byte: X
Chunk 2: 8192 bytes, first byte: X
Chunk 3: 8192 bytes, first byte: Y
Chunk 4: 424 bytes, first byte: Z
Total chunks read: 4
Expected: All 4 chunks should be in the ZIP file
Extracted file size: 424 bytes
Extracted content hash: 7089535107576022998
✗ BUG DETECTED: Data loss occurred!
Expected 25000 bytes, got 424 bytes
Data loss: 24576 bytes (98.3%)
First byte should be 'X' (88), got 'Z' (90)
⚠ Only the LAST chunk was preserved (starts with 'Z')
======================================================================
RESULT: BUG CONFIRMED - Data loss in artifact downloads!
IMPACT:
- Users downloading artifacts larger than 8192 bytes will get
corrupted/truncated files containing only the last chunk
- This affects ML models, datasets, and other large artifacts
- Silent data corruption - no error is raised
======================================================================
/root/.pyenv/versions/3.12.10/lib/python3.12/zipfile/__init__.py:1611: UserWarning: Duplicate name: 'model.pkl'
return self._open_to_write(zinfo, force_zip64=force_zip64)
Note the Python warning: UserWarning: Duplicate name: 'model.pkl' - this is Python's zipfile module warning that the same filename is being written multiple times.
Full context
The download_artifact_files_from_response function is part of ZenML's artifact management system. It enables users to download their ML artifacts (models, datasets, etc.) from the artifact store to their local machine as a ZIP file.
The function is called by:
ArtifactVersionResponse.download_files()method insrc/zenml/models/v2/core/artifact_version.py:507- This is the public API users call to download artifacts
The function works by:
- Getting the artifact store associated with the artifact
- Listing all files in the artifact's URI using
artifact_store.listdir() - For each file, opening it from the artifact store and reading it in chunks
- Writing the chunks to a ZIP file
The chunking approach was intentionally designed to avoid loading large files entirely into memory (as stated in the comment on lines 623-624). However, the implementation is incorrect because zipfile.ZipFile.writestr() creates a complete entry in the ZIP file - calling it multiple times with the same filename overwrites the previous entry rather than appending data.
External documentation
ZipFile.writestr(zinfo_or_arcname, data, compress_type=None, compresslevel=None)
Write a file into the archive. The contents is 'data', which may be either
a 'str' or a 'bytes' instance; if it is a 'str', it is encoded as UTF-8 first.
'zinfo_or_arcname' is either a ZipInfo instance or the name of the file in
the archive.
The documentation makes it clear that writestr writes a complete file entry with the provided data. It does not append to an existing entry.
ZipFile.write(filename, arcname=None, compress_type=None, compresslevel=None)
Write the file named filename to the archive, giving it the archive name
arcname (by default, this will be the same as filename, but without a drive
letter and with leading path separators removed).
The write() method is the proper way to add files from disk to a ZIP archive without loading them entirely into memory - the implementation should use this method instead.
Why has this bug gone undetected?
This bug has gone undetected for several reasons:
-
Test data is too small: The existing integration test (
test_download_artifact_files_from_responseintests/integration/functional/artifacts/test_utils.py:273) only validates downloading an artifact containing the single character "7" (1 byte). Since this is far below the 8192-byte chunk size, it never triggers the chunking loop, and the bug remains hidden. -
Silent failure: The function doesn't raise any exceptions - it silently produces corrupted output. Python's zipfile module does emit a warning (
UserWarning: Duplicate name), but warnings are typically not visible in production and don't cause tests to fail. -
Limited real-world usage: The download functionality may not be heavily used yet, or users downloading small artifacts (< 8KB) wouldn't notice the issue. Additionally, users who encounter corrupted downloads might attribute the problem to network issues or other causes rather than reporting it as a ZenML bug.
-
Recent introduction: The bug was introduced in commit 72dbb31 (PR Add
download_filesmethod forArtifactVersion#2434, merged February 15, 2024), so it's less than a year old and may not have had extensive real-world exposure yet. -
Specific conditions required: The bug only manifests when:
- Users explicitly call the
download_files()method on an artifact - The artifact contains files larger than 8192 bytes
- Users attempt to use the downloaded files
- Users explicitly call the
Recommended fix
Replace the chunked writestr approach with either:
Option 1: Collect all chunks in memory, then write once (simple but uses more memory):
with artifact_store.open(file_path, mode="rb") as store_file:
CHUNK_SIZE = 8192
chunks = []
while True:
if file_content := store_file.read(CHUNK_SIZE):
chunks.append(file_content) # <-- FIX 🟢 Collect chunks
else:
break
zipf.writestr(file_str, b''.join(chunks)) # <-- FIX 🟢 Write onceOption 2: Use a temporary file (memory efficient for very large files):
import tempfile
with artifact_store.open(file_path, mode="rb") as store_file:
with tempfile.NamedTemporaryFile(delete=False) as temp_file:
CHUNK_SIZE = 8192
while True:
if file_content := store_file.read(CHUNK_SIZE):
temp_file.write(file_content) # <-- FIX 🟢 Write to temp file
else:
break
temp_file.flush()
zipf.write(temp_file.name, arcname=file_str) # <-- FIX 🟢 Add temp file to ZIP
os.unlink(temp_file.name)Option 1 is simpler and acceptable for most artifact sizes. Option 2 is more memory-efficient for very large artifacts but requires additional temporary file management.