Refactor shard download to use local paths #625

racimrl · 2025-10-10T19:05:03Z

Updated the download function to save files locally and ensure directories exist before downloading.

Description

Related Issue(s)

Closes #[issue number]

Type of Change

Feature (adding new functionality)
Fix (resolving a bug or issue)
Docs (documentation updates)
Refactor (code changes that don't affect functionality)
Maintenance (dependency updates or other maintenance)
Tests (adding or improving tests)
Breaking change (fix or feature with incompatible API changes)
Other: _____

Branch Naming

My branch follows the project's naming convention (e.g., feature/add-new-capability)

Commit Messages

My commits are small, atomic, and have proper commit messages
Commit messages are in imperative mood with a capitalized summary under 50 chars

Code Quality

I've performed a self-review of my code
I've added appropriate docstrings following the project's conventions
I've added proper logging where necessary (without trailing periods)
I've applied linting and formatting with Ruff
My code generates no new warnings

Testing

I've added tests for new functionality or bug fixes
All tests pass locally with my changes
Test coverage has not decreased

Documentation

I've updated documentation to reflect my changes
I've updated comments in hard-to-understand areas

If this is a breaking change

Screenshots/Examples

Additional Notes

Summary by CodeRabbit

New Features
- Support saving downloaded files to specified local paths.
- Show progress during file downloads.
Bug Fixes
- Improved reliability of downloads by writing to temporary files and moving them atomically.
- Automatically create required local directories before downloading to prevent failures.
Documentation
- Updated parameter descriptions to reflect local file path inputs for tokens and IDs.

Updated the download function to save files locally and ensure directories exist before downloading.

coderabbitai · 2025-10-10T19:05:36Z

Walkthrough

Implements a safer, directory-aware download workflow in SharedShardedDataset.download_files: accepts local destination paths, derives S3 keys from basenames, ensures directories exist, streams objects without immediate load, downloads to temporary locations with progress, and atomically moves them to final paths. Adds shutil import and updates docstrings accordingly.

Changes

Cohort / File(s)	Summary
Download flow hardening `src/tplr/sharded_dataset.py`	- Accept local filesystem paths for tokens/ids; derive S3 keys from basenames - Ensure parent directories exist before download - Use s3_get_object(load_data=False) to fetch to temp files with progress - Atomically move temp files to final paths via shutil.move - Return per-file download results; update docstrings; add shutil import

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Caller
  participant SSD as SharedShardedDataset
  participant S3 as S3 Storage
  participant TMP as Temp File
  participant FS as Local FS

  C->>SSD: download_files(local_tokens_path, local_ids_path)
  SSD->>FS: ensure parent dirs exist
  SSD->>SSD: derive s3 keys from basenames
  SSD->>S3: s3_get_object(key, load_data=false, progress=true)
  S3-->>SSD: temp file handle/path
  SSD->>TMP: validate temp paths
  SSD->>FS: shutil.move(tmp_tokens -> tokens_path)
  SSD->>FS: shutil.move(tmp_ids -> ids_path)
  SSD-->>C: return individual download results

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

I fetched the shards with careful paws,
From cloudy shelves to local draws.
A temp hop—then a tidy move,
Progress bars that smoothly groove.
My burrow’s paths now neat and set—
Two files snug, the safest yet. 🐇📦

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The PR description includes the template headings but fails to populate any of the required sections with actual content, leaving the Description field empty and all checklist items unfilled, so it does not adhere to the repository’s template.	Please fill in the template by providing a clear description under the “## Description” heading, linking any related issues, selecting the appropriate type of change, confirming branch naming and commit message guidelines, and completing the code quality, testing, and documentation checklist items.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title concisely describes the primary change by indicating that the shard download functionality was refactored to support local paths, and it directly reflects the main update introduced in the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 93dcbe8 and a325eb9.

📒 Files selected for processing (1)

src/tplr/sharded_dataset.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

src/tplr/sharded_dataset.py (1)

src/tplr/comms.py (1)

s3_get_object (477-625)

🔇 Additional comments (1)

src/tplr/sharded_dataset.py (1)

20-20: LGTM! Necessary import for file operations.

The shutil import is required for the shutil.move operations added in the download_files method.

coderabbitai · 2025-10-10T19:10:31Z

src/tplr/sharded_dataset.py

+        # Extract just the filenames for S3 object keys
+        tokens_filename = os.path.basename(tokens_file)
+        ids_filename = os.path.basename(ids_file)
+
+        # Ensure the local directory exists
+        os.makedirs(os.path.dirname(tokens_file), exist_ok=True)
+


🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

Design flaw: Workaround creates intermediate files in the current directory.

The approach of extracting basenames and passing them to s3_get_object while managing full paths separately creates a fragile two-step move operation:

s3_get_object with key="train_000001.npy" (basename) downloads to temp, then moves to the current working directory

Line 294-296 performs a second move to the final destination

This workaround stems from s3_get_object conflating the S3 object key with the local destination path when load_data=False. The issues are:

Intermediate files: Files land in the current directory before the second move, which can fail if the directory is not writable or on a different filesystem

Efficiency: Two move operations instead of one direct temp→destination move

Cleanup risk: If the second move fails, intermediate files remain in the current directory

Consider one of these solutions:

Modify s3_get_object (in comms.py) to accept separate s3_key and local_path parameters when load_data=False

Pass full paths to s3_get_object and modify it to derive S3 keys from basenames internally

Download to memory (avoid load_data=False) and write to the desired location directly in this method

To verify the current directory behavior, run:

🏁 Script executed:

#!/bin/bash # Verify where files land when s3_get_object is called with basenames # Check s3_get_object implementation for load_data=False path rg -A 10 'if not load_data:' src/tplr/comms.py

Length of output: 0

🏁 Script executed:

#!/bin/bash rg -n 'def s3_get_object' -C5 .

Length of output: 5573

🏁 Script executed:

#!/bin/bash rg -n -C50 'async def s3_get_object' src/tplr/comms.py

Length of output: 11226

🏁 Script executed:

rg -n 'if load_data' -C10 src/tplr/comms.py

Length of output: 924

🏁 Script executed:

sed -n '550,650p' src/tplr/comms.py

Length of output: 4133

Refactor s3_get_object to accept an explicit download path
In src/tplr/sharded_dataset.py (lines 268–274), passing only the basename as key with load_data=False makes s3_get_object first move the file into the current working directory and then your code moves it again—this two-step move is fragile, inefficient, and leaves orphaned files on failure. Change s3_get_object to take separate s3_key and local_path parameters (or accept a full destination path), so it writes directly to the final directory.

coderabbitai · 2025-10-10T19:10:32Z

src/tplr/sharded_dataset.py

+        results = await asyncio.gather(
            self.comms.s3_get_object(
-                tokens_file,
+                tokens_filename,  # S3 object key (just filename)
                bucket,
                load_data=False,
+                show_progress=True,
            ),
            self.comms.s3_get_object(
-                ids_file,
+                ids_filename,  # S3 object key (just filename)
                bucket,
                load_data=False,
+                show_progress=True,
            ),
        )


⚠️ Potential issue | 🟠 Major

Missing error handling for download failures.

The code doesn't handle cases where s3_get_object returns None or a status dictionary (e.g., {"__status": "TOO_EARLY"}, {"__status": "TOO_LATE"}). Lines 293-296 will fail when checking os.path.exists(results[0]) if results[0] is None or a dict.

Add error handling after the asyncio.gather call:

results = await asyncio.gather( self.comms.s3_get_object( tokens_filename, # S3 object key (just filename) bucket, load_data=False, show_progress=True, ), self.comms.s3_get_object( ids_filename, # S3 object key (just filename) bucket, load_data=False, show_progress=True, ), ) + +# Validate download results +for i, (result, filename) in enumerate(zip(results, [tokens_file, ids_file])): + if result is None: + raise RuntimeError(f"Failed to download {os.path.basename(filename)}") + if isinstance(result, dict) and "__status" in result: + raise RuntimeError(f"Download skipped for {os.path.basename(filename)}: {result['__status']}")

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2025-10-10T19:10:32Z

src/tplr/sharded_dataset.py

+        # Move downloaded files to correct locations
+        # s3_get_object with load_data=False returns the temp file path
+        if results[0] and os.path.exists(results[0]):
+            shutil.move(results[0], tokens_file)
+        if results[1] and os.path.exists(results[1]):
+            shutil.move(results[1], ids_file)
+
+        return results


⚠️ Potential issue | 🟠 Major

Improve robustness of file move operations.

The current implementation has several issues:

Type safety: results[0] might be None, a dict, or a string path - the code only handles the string case

Incomplete validation: os.path.exists check happens after potential type errors

Silent failures: If a file doesn't exist or move fails, the code continues without error

Misleading return: Returns the intermediate paths from s3_get_object, which no longer exist after the moves

Apply this diff to improve error handling and validation:

-# Move downloaded files to correct locations -# s3_get_object with load_data=False returns the temp file path -if results[0] and os.path.exists(results[0]): - shutil.move(results[0], tokens_file) -if results[1] and os.path.exists(results[1]): - shutil.move(results[1], ids_file) - -return results +# Move downloaded files to correct locations +# s3_get_object with load_data=False returns the intermediate file path +moved_files = [] +for result, final_path in zip(results, [tokens_file, ids_file]): + if not isinstance(result, str): + raise RuntimeError(f"Expected file path from download, got {type(result)}") + if not os.path.exists(result): + raise FileNotFoundError(f"Downloaded file not found: {result}") + + try: + final_location = shutil.move(result, final_path) + moved_files.append(final_location) + except Exception as e: + raise RuntimeError(f"Failed to move {result} to {final_path}: {e}") + +return moved_files

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Move downloaded files to correct locations

# s3_get_object with load_data=False returns the temp file path

if results[0] and os.path.exists(results[0]):

shutil.move(results[0], tokens_file)

if results[1] and os.path.exists(results[1]):

shutil.move(results[1], ids_file)

return results

# Move downloaded files to correct locations

# s3_get_object with load_data=False returns the intermediate file path

moved_files = []

for result, final_path in zip(results, [tokens_file, ids_file]):

if not isinstance(result, str):

raise RuntimeError(f"Expected file path from download, got {type(result)}")

if not os.path.exists(result):

raise FileNotFoundError(f"Downloaded file not found: {result}")

try:

final_location = shutil.move(result, final_path)

moved_files.append(final_location)

except Exception as e:

raise RuntimeError(f"Failed to move {result} to {final_path}: {e}")

return moved_files

🤖 Prompt for AI Agents

In src/tplr/sharded_dataset.py around lines 291 to 298, the file-move logic is fragile: results entries can be None or non-string, existence is checked too late, failures are silent, and the function returns intermediate temp paths that no longer exist; update the code to validate each result is a non-empty string before touching the filesystem, check os.path.exists on that validated path, wrap shutil.move in a try/except to raise or log a clear error on failure, and after successful moves replace the corresponding entries in the returned value with the final destination paths (tokens_file and ids_file) so the caller receives current, correct paths — if a move cannot be completed, raise an exception rather than silently continuing.

Refactor shard download to use local paths

a325eb9

Updated the download function to save files locally and ensure directories exist before downloading.

coderabbitai bot reviewed Oct 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor shard download to use local paths #625

Refactor shard download to use local paths #625

Uh oh!

racimrl commented Oct 10, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 10, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 10, 2025

Uh oh!

coderabbitai bot Oct 10, 2025

Uh oh!

coderabbitai bot Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-        # Move downloaded files to correct locations
-        # s3_get_object with load_data=False returns the temp file path
-        if results[0] and os.path.exists(results[0]):
-            shutil.move(results[0], tokens_file)
-        if results[1] and os.path.exists(results[1]):
-            shutil.move(results[1], ids_file)
-        return results
+        # Move downloaded files to correct locations
+        # s3_get_object with load_data=False returns the intermediate file path
+        moved_files = []
+        for result, final_path in zip(results, [tokens_file, ids_file]):
+            if not isinstance(result, str):
+                raise RuntimeError(f"Expected file path from download, got {type(result)}")
+            if not os.path.exists(result):
+                raise FileNotFoundError(f"Downloaded file not found: {result}")
+            try:
+                final_location = shutil.move(result, final_path)
+                moved_files.append(final_location)
+            except Exception as e:
+                raise RuntimeError(f"Failed to move {result} to {final_path}: {e}")
+        return moved_files

Refactor shard download to use local paths #625

Are you sure you want to change the base?

Refactor shard download to use local paths #625

Uh oh!

Conversation

racimrl commented Oct 10, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

Type of Change

Branch Naming

Commit Messages

Code Quality

Testing

Documentation

If this is a breaking change

Screenshots/Examples

Additional Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

racimrl commented Oct 10, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 10, 2025 •

edited

Loading