[RFC] Design proposal for `model_download` custom download path

## Proposal: Adding `output_dir` to `ModelHttpResolver` for Direct Downloading

This proposal outlines a plan to add an `output_dir` argument to the `__call__` method of the `ModelHttpResolver` class. This new argument will allow users to download models directly to a specified directory, bypassing the default caching mechanism and using a flat directory structure.

**Motivation:**

This feature is primarily driven by user-feedback (https://github.com/Kaggle/kagglehub/issues/175) and the need for better integration between `kagglehub` and other libraries, specifically `torchtune` (see https://github.com/pytorch/torchtune/issues/1852). 

> NOTE: `torchtune` currently uses `huggingface_hub` for model downloading, leveraging its `snapshot_download` function and its [`local_dir`](https://huggingface.co/docs/huggingface_hub/en/guides/download#download-files-to-a-local-folder) argument for specifying custom download locations.

**Proposed Changes:**

1. **Introduce a `Cache` class:**

   This class will encapsulate the caching logic from `cache.py`, providing a cleaner interface and enabling the flexible directory structure based on the `override_dir` argument.

   ```python
   class Cache:
       def __init__(self, override_dir: Optional[str] = None):
           self.override_dir = override_dir

       def get_path(self, handle: ResourceHandle, path: Optional[str] = None) -> str:
           if self.override_dir:
               return os.path.join(self.override_dir, path or "")  # Flat structure
           else:
               return get_cached_path(handle, path)  # Existing nested structure
   
       def get_archive_path(self, handle: ResourceHandle) -> str:
           if self.override_dir:
               return os.path.join(self.override_dir, f"{handle.version!s}.archive")
           else:
               return get_cached_archive_path(handle)
   
       def get_completion_marker_filepath(self, handle: ResourceHandle, path: Optional[str] = None) -> str:
           if self.override_dir:
                return os.path.join(self.override_dir, f"{path or handle.version!s}.complete")
           else:
               return _get_completion_marker_filepath(handle, path)

       def load_from_cache(self, handle: ResourceHandle, path: Optional[str] = None) -> Optional[str]:
           """Return path for the requested resource from the cache or output_dir."""
           marker_path = self.get_completion_marker_filepath(handle, path)
           full_path = self.get_path(handle, path)
           return full_path if os.path.exists(marker_path) and os.path.exists(full_path) else None

       # Migrate other relevant methods from cache.py (e.g., mark_as_complete, delete_from_cache)  to the Cache class, adapting them similarly.
   ```

3. **Modify `ModelHttpResolver.__call__`:**

   Add the `output_dir` argument and instantiate the `Cache` class based on its value.

   ```python
   def __call__(
       self, h: ModelHandle, path: Optional[str] = None, *, force_download: Optional[bool] = False, output_dir: Optional[str] = None
   ) -> str:
       cache = Cache(override_dir=output_dir)
       # ...use cache.get_path(), cache.get_archive_path(), cache.load_from_cache(), and cache.get_completion_marker_filepath()...
   ```

4. **Pass `output_dir` through `model_download`:**

   ```python
   def model_download(
       # ...existing arguments
       output_dir: Optional[str] = None,

   ):
       # Existing code...
       return registry.model_resolver(h, path, force_download=force_download, output_dir=output_dir)
   ```

5. **Handle `output_dir` in other Resolvers:**

   Add warning logs to other resolvers to inform users that the `output_dir` argument is not supported and will be ignored.

**Advantages:**

*   **Flexibility:** Users can choose between using the cache or downloading directly to a specific directory. This also allows users to bypass the deeply nested subdirectory structure in the default cache.
*   **Improved Code Organization:** Encapsulating caching logic within the `Cache` class discourages use of global variables, promoting better maintainability and extensibility.

**Request for Feedback:**

This proposal outlines one possible implementation. Any feedback on alternative implementations is appreciated. Some other passing thoughts include:

*   **`Cache` class design:** Is this the best way to encapsulate the caching logic? Are there alternative approaches?
*   **Flat directory structure:** Is this the most user-friendly approach for direct downloads? Should we consider other directory structures?
*   **Error handling:** How should errors related to `output_dir` (e.g., permissions issues) be handled?
*   **Interaction with existing cache:** If `output_dir` is specified, should we completely bypass the cache or should we consider using it as a secondary storage location? How might we handle cache invalidation in such a case?
*   **Naming**: I chose `output_dir` to align with `torchtune`. But `download_path` was suggested in https://github.com/Kaggle/kagglehub/issues/175. This may or may not make the intention clearer. 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Design proposal for `model_download` custom download path #179

Proposal: Adding `output_dir` to `ModelHttpResolver` for Direct Downloading

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Design proposal for model_download custom download path #179

Description

Proposal: Adding output_dir to ModelHttpResolver for Direct Downloading

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[RFC] Design proposal for `model_download` custom download path #179

Proposal: Adding `output_dir` to `ModelHttpResolver` for Direct Downloading