Osdf optimization by atripathy86 · Pull Request #332 · HewlettPackard/cmf

atripathy86 · 2026-03-13T23:32:25Z

Related Issues / Pull Requests

Improve Handling of OSDF pulls. Debug why pulls from caches fail occasionally.

Description

Switches OSDF downloads from in-memory (response.content) to streaming (iter_content) with inline MD5 computation, avoiding large memory allocations for big files.
Add --debug flag to cmf artifact pull for OSDF download diagnostics CLI flag prints per-file debug info during OSDF artifact pulls, the cache URL actually attempted, file size, download time, and rate (KB/s or MB/s).
Resolves the actual cache server URL upfront via a HEAD request (following redirects, mirroring curl -L --head -w "%{url_effective}"), then retries the resolved URL up to 3 times before falling back to origin.
Origin downloads also retry up to 3 times. Logs per-attempt failures with attempt number.
In --debug mode, prints the resolved cache URL and per-attempt failure details.
Also surfaces the HTTP-resolved URL on successful downloads when it differs from the requested URL.

What changes are proposed in this pull request?

Bug fix (non-breaking change which fixes an issue).
New feature (non-breaking change which adds functionality).
Breaking change (fix or feature that would cause existing functionality to not work as expected; for instance,
examples in this repository need to be updated too).
This change requires a documentation update.

Checklist:

My code follows the style guidelines of this project (PEP-8 with Google-style docstrings).
My code modifies existing public API, or introduces new public API, and I updated or wrote docstrings that
uses Google-style formatting and any other formatting that is supported by mkdocs and plugins this project
uses.
I have commented my code.
My code requires documentation updates, and I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works.
[] New and existing unit tests pass locally with my changes.

- Switches OSDF downloads from in-memory (response.content) to streaming (iter_content) with inline MD5 computation, avoiding large memory allocations for big files. - Add --debug flag to cmf artifact pull for OSDF download diagnostics CLI flag prints per-file debug info during OSDF artifact pulls, the cache URL actually attempted, file size, download time, and rate (KB/s or MB/s).

…oads - Resolves the actual cache server URL upfront via a HEAD request (following redirects, mirroring `curl -L --head -w "%{url_effective}"`), then retries the resolved URL up to 3 times before falling back to origin. - Origin downloads also retry up to 3 times. Logs per-attempt failures with attempt number. - In --debug mode, prints the resolved cache URL and per-attempt failure details. - Also surfaces the HTTP-resolved URL on successful downloads when it differs from the requested URL.

- Moved 'Resolved cache to' print to before the download loop (using HEAD result) so it appears immediately for large files instead of after the download completes. - Print resolved URL exactly once per file, always stripping auth params (?authz=). Print failure reason in logs unconditionally. - Add 15s sleep with log message between retries. - Refactor download_and_verify_file to return a 3-tuple (success, message, resolved_url) so callers have access to the actual GET-resolved URL. - Add separator line and newline after each downloaded file in debug mode. (Martin's request)

Copilot

Pull request overview

Improves OSDF artifact pull reliability and observability by switching downloads to a streaming approach, adding retry logic for cache/origin, and exposing a --debug CLI flag for per-file diagnostics.

Changes:

Stream OSDF downloads with inline MD5 computation (reduces memory usage).
Add cache/origin retry loops and resolve cache redirect targets via HEAD.
Add cmf artifact pull --debug to print per-file diagnostics during OSDF pulls.