ZIP entry overwritten when streaming artifact chunks in `download_artifact_files_from_response`

# Summary
- **Context**: The `download_artifact_files_from_response` function downloads artifact files from ZenML's artifact store and packages them into a ZIP file for users to retrieve their ML models, datasets, and other artifacts.
- **Bug**: The function calls `zipf.writestr()` inside a loop, which overwrites the ZIP entry on each iteration instead of appending chunks, resulting in only the last chunk being preserved.
- **Actual vs. expected**: For files larger than 8192 bytes (CHUNK_SIZE), the downloaded ZIP file contains only the final chunk (typically <8KB) instead of the complete file, causing 95-99% data loss.
- **Impact**: Users downloading artifacts larger than 8KB receive severely corrupted files with massive data loss, breaking ML model deployment, dataset downloads, and artifact retrieval workflows.

# Code with bug

```python
# Lines 612-630 in src/zenml/artifacts/utils.py
with zipfile.ZipFile(path, "w", zipfile.ZIP_DEFLATED) as zipf:
    for file in filepaths:
        # Ensure 'file' is a string for path operations
        # and ZIP entry naming
        file_str = (
            file.decode() if isinstance(file, bytes) else file
        )
        file_path = str(Path(artifact.uri) / file_str)
        with artifact_store.open(
            file_path, mode="rb"
        ) as store_file:
            # Use a loop to read and write chunks of the file
            # instead of reading the entire file into memory
            CHUNK_SIZE = 8192
            while True:
                if file_content := store_file.read(CHUNK_SIZE):
                    zipf.writestr(file_str, file_content)  # <-- BUG 🔴 Overwrites on each loop iteration
                else:
                    break
```

The comment on line 623-624 states the intent is to "read and write chunks of the file instead of reading the entire file into memory", but the implementation incorrectly calls `zipf.writestr()` multiple times with the same filename, which overwrites the entry each time.

# Evidence

## Example

Consider a 25,000-byte artifact file being downloaded:

1. **First iteration**: Read 8,192 bytes (chunk 1: 'X' * 8192), call `zipf.writestr("model.pkl", chunk1)` → ZIP contains 8,192 bytes
2. **Second iteration**: Read 8,192 bytes (chunk 2: 'X' * 8192), call `zipf.writestr("model.pkl", chunk2)` → ZIP entry overwritten, now contains 8,192 bytes (just chunk 2)
3. **Third iteration**: Read 8,192 bytes (chunk 3: 'Y' * 8192), call `zipf.writestr("model.pkl", chunk3)` → ZIP entry overwritten again, now contains 8,192 bytes (just chunk 3)
4. **Fourth iteration**: Read 424 bytes (chunk 4: 'Z' * 424), call `zipf.writestr("model.pkl", chunk4)` → ZIP entry overwritten final time, now contains only 424 bytes

**Result**: User downloads a ZIP containing only 424 bytes (the last chunk) instead of the complete 25,000-byte file, resulting in 98.3% data loss.

## Failing test

### Test script

```python
#!/usr/bin/env python3
"""
Test demonstrating the bug in download_artifact_files_from_response.

This test creates a realistic scenario with artifact files larger than 8192 bytes
and shows that the current implementation loses data when downloading artifacts.
"""

import os
import tempfile
import zipfile
from pathlib import Path


def simulate_buggy_download():
    """Simulate the buggy download_artifact_files_from_response behavior."""

    # Create a temporary directory structure like an artifact store would have
    artifact_dir = tempfile.mkdtemp()

    try:
        # Create a test file with content larger than CHUNK_SIZE (8192 bytes)
        # This simulates a realistic artifact file
        test_file_path = os.path.join(artifact_dir, "model.pkl")
        test_content = b"X" * 10000 + b"Y" * 10000 + b"Z" * 5000  # 25,000 bytes

        with open(test_file_path, 'wb') as f:
            f.write(test_content)

        print(f"Created test artifact file: {test_file_path}")
        print(f"Original file size: {len(test_content)} bytes")
        print(f"Original file content hash: {hash(test_content)}")

        # Simulate the buggy code from download_artifact_files_from_response
        zip_path = tempfile.mktemp(suffix='.zip')

        with zipfile.ZipFile(zip_path, "w", zipfile.ZIP_DEFLATED) as zipf:
            # This is exactly how the buggy code works
            file_str = "model.pkl"
            file_path = test_file_path

            with open(file_path, mode="rb") as store_file:
                CHUNK_SIZE = 8192
                chunk_count = 0
                while True:
                    if file_content := store_file.read(CHUNK_SIZE):
                        chunk_count += 1
                        print(f"  Chunk {chunk_count}: {len(file_content)} bytes, first byte: {file_content[0]:c}")
                        # BUG: This overwrites the entry each time!
                        zipf.writestr(file_str, file_content)
                    else:
                        break

        print(f"\nTotal chunks read: {chunk_count}")
        print(f"Expected: All {chunk_count} chunks should be in the ZIP file")

        # Now extract and verify
        with zipfile.ZipFile(zip_path, "r") as zipf:
            extracted_content = zipf.read("model.pkl")

        print(f"\nExtracted file size: {len(extracted_content)} bytes")
        print(f"Extracted content hash: {hash(extracted_content)}")

        if extracted_content == test_content:
            print("\n✓ SUCCESS: File was downloaded correctly")
            return True
        else:
            print(f"\n✗ BUG DETECTED: Data loss occurred!")
            print(f"  Expected {len(test_content)} bytes, got {len(extracted_content)} bytes")
            print(f"  Data loss: {len(test_content) - len(extracted_content)} bytes ({100 * (len(test_content) - len(extracted_content)) / len(test_content):.1f}%)")
            print(f"  First byte should be 'X' ({ord('X')}), got '{chr(extracted_content[0])}' ({extracted_content[0]})")

            # The last chunk should be preserved
            if extracted_content[0] == ord('Z'):
                print(f"  ⚠ Only the LAST chunk was preserved (starts with 'Z')")

            return False

    finally:
        # Cleanup
        if os.path.exists(artifact_dir):
            import shutil
            shutil.rmtree(artifact_dir)
        if os.path.exists(zip_path):
            os.unlink(zip_path)


if __name__ == "__main__":
    print("=" * 70)
    print("Testing download_artifact_files_from_response with realistic data")
    print("=" * 70)
    print()

    result = simulate_buggy_download()

    print("\n" + "=" * 70)
    if result:
        print("RESULT: No bug detected (unexpected)")
    else:
        print("RESULT: BUG CONFIRMED - Data loss in artifact downloads!")
        print("\nIMPACT:")
        print("  - Users downloading artifacts larger than 8192 bytes will get")
        print("    corrupted/truncated files containing only the last chunk")
        print("  - This affects ML models, datasets, and other large artifacts")
        print("  - Silent data corruption - no error is raised")
    print("=" * 70)
```

### Test output

```
======================================================================
Testing download_artifact_files_from_response with realistic data
======================================================================

Created test artifact file: /tmp/tmpp142h27a/model.pkl
Original file size: 25000 bytes
Original file content hash: 8802438793657415455
  Chunk 1: 8192 bytes, first byte: X
  Chunk 2: 8192 bytes, first byte: X
  Chunk 3: 8192 bytes, first byte: Y
  Chunk 4: 424 bytes, first byte: Z

Total chunks read: 4
Expected: All 4 chunks should be in the ZIP file

Extracted file size: 424 bytes
Extracted content hash: 7089535107576022998

✗ BUG DETECTED: Data loss occurred!
  Expected 25000 bytes, got 424 bytes
  Data loss: 24576 bytes (98.3%)
  First byte should be 'X' (88), got 'Z' (90)
  ⚠ Only the LAST chunk was preserved (starts with 'Z')

======================================================================
RESULT: BUG CONFIRMED - Data loss in artifact downloads!

IMPACT:
  - Users downloading artifacts larger than 8192 bytes will get
    corrupted/truncated files containing only the last chunk
  - This affects ML models, datasets, and other large artifacts
  - Silent data corruption - no error is raised
======================================================================
/root/.pyenv/versions/3.12.10/lib/python3.12/zipfile/__init__.py:1611: UserWarning: Duplicate name: 'model.pkl'
  return self._open_to_write(zinfo, force_zip64=force_zip64)
```

Note the Python warning: `UserWarning: Duplicate name: 'model.pkl'` - this is Python's zipfile module warning that the same filename is being written multiple times.

# Full context

The `download_artifact_files_from_response` function is part of ZenML's artifact management system. It enables users to download their ML artifacts (models, datasets, etc.) from the artifact store to their local machine as a ZIP file.

The function is called by:
- `ArtifactVersionResponse.download_files()` method in `src/zenml/models/v2/core/artifact_version.py:507` - This is the public API users call to download artifacts

The function works by:
1. Getting the artifact store associated with the artifact
2. Listing all files in the artifact's URI using `artifact_store.listdir()`
3. For each file, opening it from the artifact store and reading it in chunks
4. Writing the chunks to a ZIP file

The chunking approach was intentionally designed to avoid loading large files entirely into memory (as stated in the comment on lines 623-624). However, the implementation is incorrect because `zipfile.ZipFile.writestr()` creates a complete entry in the ZIP file - calling it multiple times with the same filename overwrites the previous entry rather than appending data.

## External documentation

- [Python zipfile.ZipFile.writestr documentation](https://docs.python.org/3/library/zipfile.html#zipfile.ZipFile.writestr)
```
ZipFile.writestr(zinfo_or_arcname, data, compress_type=None, compresslevel=None)

Write a file into the archive. The contents is 'data', which may be either
a 'str' or a 'bytes' instance; if it is a 'str', it is encoded as UTF-8 first.
'zinfo_or_arcname' is either a ZipInfo instance or the name of the file in
the archive.
```

The documentation makes it clear that `writestr` writes a complete file entry with the provided `data`. It does not append to an existing entry.

- [Python zipfile.ZipFile.write documentation](https://docs.python.org/3/library/zipfile.html#zipfile.ZipFile.write)
```
ZipFile.write(filename, arcname=None, compress_type=None, compresslevel=None)

Write the file named filename to the archive, giving it the archive name
arcname (by default, this will be the same as filename, but without a drive
letter and with leading path separators removed).
```

The `write()` method is the proper way to add files from disk to a ZIP archive without loading them entirely into memory - the implementation should use this method instead.

# Why has this bug gone undetected?

This bug has gone undetected for several reasons:

1. **Test data is too small**: The existing integration test (`test_download_artifact_files_from_response` in `tests/integration/functional/artifacts/test_utils.py:273`) only validates downloading an artifact containing the single character "7" (1 byte). Since this is far below the 8192-byte chunk size, it never triggers the chunking loop, and the bug remains hidden.

2. **Silent failure**: The function doesn't raise any exceptions - it silently produces corrupted output. Python's zipfile module does emit a warning (`UserWarning: Duplicate name`), but warnings are typically not visible in production and don't cause tests to fail.

3. **Limited real-world usage**: The download functionality may not be heavily used yet, or users downloading small artifacts (< 8KB) wouldn't notice the issue. Additionally, users who encounter corrupted downloads might attribute the problem to network issues or other causes rather than reporting it as a ZenML bug.

4. **Recent introduction**: The bug was introduced in commit 72dbb31 (PR #2434, merged February 15, 2024), so it's less than a year old and may not have had extensive real-world exposure yet.

5. **Specific conditions required**: The bug only manifests when:
   - Users explicitly call the `download_files()` method on an artifact
   - The artifact contains files larger than 8192 bytes
   - Users attempt to use the downloaded files

# Recommended fix

Replace the chunked `writestr` approach with either:

**Option 1**: Collect all chunks in memory, then write once (simple but uses more memory):
```python
with artifact_store.open(file_path, mode="rb") as store_file:
    CHUNK_SIZE = 8192
    chunks = []
    while True:
        if file_content := store_file.read(CHUNK_SIZE):
            chunks.append(file_content)  # <-- FIX 🟢 Collect chunks
        else:
            break
    zipf.writestr(file_str, b''.join(chunks))  # <-- FIX 🟢 Write once
```

**Option 2**: Use a temporary file (memory efficient for very large files):
```python
import tempfile

with artifact_store.open(file_path, mode="rb") as store_file:
    with tempfile.NamedTemporaryFile(delete=False) as temp_file:
        CHUNK_SIZE = 8192
        while True:
            if file_content := store_file.read(CHUNK_SIZE):
                temp_file.write(file_content)  # <-- FIX 🟢 Write to temp file
            else:
                break
        temp_file.flush()
        zipf.write(temp_file.name, arcname=file_str)  # <-- FIX 🟢 Add temp file to ZIP
    os.unlink(temp_file.name)
```

Option 1 is simpler and acceptable for most artifact sizes. Option 2 is more memory-efficient for very large artifacts but requires additional temporary file management.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ZIP entry overwritten when streaming artifact chunks in `download_artifact_files_from_response` #4349

Summary

Code with bug

Evidence

Example

Failing test

Test script

Test output

Full context

External documentation

Why has this bug gone undetected?

Recommended fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ZIP entry overwritten when streaming artifact chunks in download_artifact_files_from_response #4349

Description

Summary

Code with bug

Evidence

Example

Failing test

Test script

Test output

Full context

External documentation

Why has this bug gone undetected?

Recommended fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

ZIP entry overwritten when streaming artifact chunks in `download_artifact_files_from_response` #4349