Skip to content

fix: preserve streamed SBOM metadata and exclude HF cache files#673

Merged
mldangelo merged 2 commits intomainfrom
fix/streamed-sbom-artifacts
Mar 10, 2026
Merged

fix: preserve streamed SBOM metadata and exclude HF cache files#673
mldangelo merged 2 commits intomainfrom
fix/streamed-sbom-artifacts

Conversation

@mldangelo
Copy link
Member

@mldangelo mldangelo commented Mar 10, 2026

Follow-up to #672.

Summary

  • preserve per-component size and SHA-256 metadata for streamed SBOM entries even after streamed files are deleted
  • keep generating streamed SBOMs from the scanned streamed artifact list
  • exclude Hugging Face cache bookkeeping files like .metadata and .gitignore from remote SBOMs and asset lists
  • preserve real downloaded artifacts such as merges.txt in non-streamed Hugging Face SBOMs

Validation

  • uv run ruff format modelaudit/ tests/
  • uv run ruff check --fix modelaudit/ tests/
  • uv run mypy modelaudit/
  • uv run pytest -n auto -m "not slow and not integration" --maxfail=1

End-to-end checks

  • uv run modelaudit scan --stream --quiet --format sarif --output .../recheck-gpt2-stream.sarif --sbom .../recheck-gpt2-stream.sbom.json hf://openai-community/gpt2
    • result: 24 SBOM components, no zero-size components, no missing hashes, SARIF results = 0
  • uv run modelaudit scan --quiet --format json --output .../recheck-tiny-full.json --sbom .../recheck-tiny-full.sbom.json hf://sshleifer/tiny-gpt2
    • result: merges.txt present in the SBOM and no .metadata / .gitignore cache components

Notes

The full pytest command hit an unrelated one-off timing failure in tests/test_cache_optimizations.py::TestCacheOptimizationPerformance::test_configuration_extraction_performance on the first run; the isolated test passed immediately and the full command passed on rerun.

Summary by CodeRabbit

  • Bug Fixes

    • Streamed artifacts are now correctly included in SBOMs when using streaming scans.
    • HuggingFace download cache bookkeeping files are excluded from SBOMs, asset lists, and directory scans.
    • SBOM components now reliably surface accurate file sizes and SHA-256 hashes.
  • Tests

    • Expanded tests to cover streamed SBOMs, per-file metadata, and exclusion of cache bookkeeping files.

Fix streamed SBOM generation so deleted streamed artifacts still retain component size and SHA-256 metadata.

Also exclude Hugging Face cache bookkeeping files from remote SBOMs and asset lists while preserving real downloaded artifacts like merges.txt.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 10, 2026

Walkthrough

This PR fixes SBOM generation with --stream by selecting streamed asset paths for SBOMs, skipping HuggingFace cache bookkeeping files during scans, enriching per-file metadata with size and sha256, and adding helpers to resolve component size/hash and filter irrelevant SBOM files.

Changes

Cohort / File(s) Summary
CLI SBOM path selection
modelaudit/cli.py
Use deduplicated streamed asset paths as the SBOM paths when assets exist and final_scan_and_delete is true; otherwise retain fallback to scanned/expanded paths.
Core scanning & streaming merge
modelaudit/core.py
Skip HuggingFace download-cache bookkeeping files during directory iteration (with debug logs). When merging streamed scan results, inject per-file file_size, ensure sha256 exists in file_hashes, and attach file_metadata to merged results.
SBOM generation helpers & filtering
modelaudit/integrations/sbom_generator.py
Add _resolve_component_size_and_sha256(path, metadata) to prefer on-disk size/hash with metadata fallback, and _should_skip_sbom_file(path) to filter cache/metadata/lock/git-ref files. Use these helpers in both Pydantic and non-Pydantic component generation flows and skip irrelevant files early.
Tests
tests/test_cli.py
Extend mock scan helper to accept file_metadata; add/adjust tests to validate streamed assets appear as SBOM components with correct size/sha256 and that HuggingFace cache bookkeeping files are excluded in streamed and directory scans.
Changelog
CHANGELOG.md
Documented fixes: include streamed artifacts as SBOM components with --stream --sbom, and exclude HuggingFace download cache bookkeeping files from remote SBOMs and asset lists.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped through streams and chased each file,

sizes and sha256s now sitting in a pile.
Cache crumbs I skipped with a twinkle and grin,
SBOMs now list the true artifacts within.
Hooray — the rabbit fixed the pipeline spin! 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main changes: preserving streamed SBOM metadata and excluding HuggingFace cache files, which are the core fixes in this PR.
Linked Issues check ✅ Passed Code changes fully address issue #671 requirements: preserve streamed artifact metadata with size/SHA-256, generate SBOMs from scanned assets, and exclude HuggingFace cache bookkeeping files.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the SBOM streaming issue and HuggingFace cache handling; no unrelated functionality was introduced.
Docstring Coverage ✅ Passed Docstring coverage is 81.25% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/streamed-sbom-artifacts

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/core.py`:
- Around line 666-670: The skip for HuggingFace cache files is applied too
broadly because _is_huggingface_cache_file(file_path) matches names like "main"
or "HEAD" regardless of location; change the check so we only skip when the file
is actually inside a HuggingFace cache directory. Update the branch where
file_path is tested (the code calling _is_huggingface_cache_file) to require
both that _is_huggingface_cache_file(file_path) is true and that the path is
inside a HF cache root (e.g., inspect pathlib.Path(file_path).parents for known
cache markers like ".cache/huggingface" or implement a helper
_is_within_hf_cache_dir(file_path) and call that together with
_is_huggingface_cache_file), or alternatively modify _is_huggingface_cache_file
to perform the parent-directory check itself; only then log and continue.

In `@modelaudit/integrations/sbom_generator.py`:
- Around line 394-395: The SBOM skip predicate (_should_skip_sbom_file) is only
applied when walking directories, so the flat list paths_for_sbom and the
single-file/else branches still include .metadata/.gitignore/lock files; update
the code paths that iterate over paths_for_sbom and the single-file emission
branch to call _should_skip_sbom_file(fp) (or the local file variable) and
continue when it returns True, ensuring the same skip behavior for streamed/flat
inputs as for directory walks.

In `@tests/test_cli.py`:
- Around line 791-793: The test function
test_scan_huggingface_streaming_sbom_includes_streamed_assets is missing the
required typing: add a return type annotation "-> None" and annotate the
tmp_path parameter as "tmp_path: Path" (import Path from pathlib if not
already). Apply the same change (add "-> None" and type tmp_path as Path or
monkeypatch as pytest.MonkeyPatch where appropriate) to the other test functions
flagged in the review so all tests follow the repo convention.
- Around line 855-861: The test currently assumes SHA-256 is at index 0 via
component["hashes"][0], which breaks if hashes are reordered or expanded; update
the loop in tests/test_cli.py that iterates streamed_files (using variables
file_path, component, properties) to locate the hash entry with alg == "SHA-256"
(e.g. use a generator/filter or next(...) to find the matching dict) and assert
its "content" equals file_hashes[str(file_path)]; also fail the test with a
clear message if no SHA-256 entry is found instead of indexing into position 0.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: b93b86ae-ea1f-47ea-9d1b-1aed3d2b97e2

📥 Commits

Reviewing files that changed from the base of the PR and between 9d77d50 and 00bb261.

📒 Files selected for processing (5)
  • CHANGELOG.md
  • modelaudit/cli.py
  • modelaudit/core.py
  • modelaudit/integrations/sbom_generator.py
  • tests/test_cli.py

Comment on lines +666 to +670
# HuggingFace cache bookkeeping files should never surface as
# scan assets or SBOM components for downloaded models.
if _is_huggingface_cache_file(file_path):
logger.debug(f"Skipping HuggingFace cache file: {file_path}")
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't apply the Hugging Face cache skip globally.

_is_huggingface_cache_file() matches names like main and HEAD regardless of location. Because this new branch runs for every directory scan, a normal local artifact with one of those names now disappears from scanning and SBOM generation.

Proposed fix
-                    if _is_huggingface_cache_file(file_path):
+                    if is_hf_cache and _is_huggingface_cache_file(file_path):
                         logger.debug(f"Skipping HuggingFace cache file: {file_path}")
                         continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelaudit/core.py` around lines 666 - 670, The skip for HuggingFace cache
files is applied too broadly because _is_huggingface_cache_file(file_path)
matches names like "main" or "HEAD" regardless of location; change the check so
we only skip when the file is actually inside a HuggingFace cache directory.
Update the branch where file_path is tested (the code calling
_is_huggingface_cache_file) to require both that
_is_huggingface_cache_file(file_path) is true and that the path is inside a HF
cache root (e.g., inspect pathlib.Path(file_path).parents for known cache
markers like ".cache/huggingface" or implement a helper
_is_within_hf_cache_dir(file_path) and call that together with
_is_huggingface_cache_file), or alternatively modify _is_huggingface_cache_file
to perform the parent-directory check itself; only then log and continue.

Comment on lines +394 to +395
if _should_skip_sbom_file(fp):
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Apply the SBOM skip filter to single-file inputs too.

This only filters directory walks. In the new streamed flow, paths_for_sbom is a flat list of asset files, so the else branches below still emit .metadata/.gitignore/lock files if one reaches the asset list.

Proposed fix
         else:
+            if _should_skip_sbom_file(input_path):
+                continue
             meta_model = file_meta.get(input_path)
             # Convert Pydantic model to dict if needed
             if meta_model is not None and hasattr(meta_model, "model_dump"):
                 meta = meta_model.model_dump()
             else:
                 meta = meta_model or {}
             component = _component_for_file(input_path, meta, issues_dicts)
             bom.components.add(component)
         else:
+            if _should_skip_sbom_file(input_path):
+                continue
             metadata = file_metadata.get(input_path)
             component = _component_for_file_pydantic(input_path, metadata, issues)
             bom.components.add(component)

Also applies to: 436-437

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelaudit/integrations/sbom_generator.py` around lines 394 - 395, The SBOM
skip predicate (_should_skip_sbom_file) is only applied when walking
directories, so the flat list paths_for_sbom and the single-file/else branches
still include .metadata/.gitignore/lock files; update the code paths that
iterate over paths_for_sbom and the single-file emission branch to call
_should_skip_sbom_file(fp) (or the local file variable) and continue when it
returns True, ensuring the same skip behavior for streamed/flat inputs as for
directory walks.

Comment on lines +791 to +793
def test_scan_huggingface_streaming_sbom_includes_streamed_assets(
mock_scan_streaming, mock_download_streaming, mock_is_hf_url, tmp_path
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add the required test annotations.

These new tests are missing the repo-required -> None return annotations and tmp_path: Path parameter typing.

As per coding guidelines, "Use type hints -> None on all test methods and tmp_path: Path / monkeypatch: pytest.MonkeyPatch on test parameters".

Also applies to: 868-870, 919-919

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_cli.py` around lines 791 - 793, The test function
test_scan_huggingface_streaming_sbom_includes_streamed_assets is missing the
required typing: add a return type annotation "-> None" and annotate the
tmp_path parameter as "tmp_path: Path" (import Path from pathlib if not
already). Apply the same change (add "-> None" and type tmp_path as Path or
monkeypatch as pytest.MonkeyPatch where appropriate) to the other test functions
flagged in the review so all tests follow the repo convention.

Comment on lines +855 to +861
for file_path in streamed_files:
component = components[file_path.name]
properties = {prop["name"]: prop["value"] for prop in component["properties"]}

assert properties["size"] == str(file_sizes[str(file_path)])
assert component["hashes"][0]["alg"] == "SHA-256"
assert component["hashes"][0]["content"] == file_hashes[str(file_path)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Don't couple the test to hash ordering.

component["hashes"][0] makes this test fail if SBOM generation adds another hash or reorders the list while still emitting the correct SHA-256. Match the hash by alg instead.

Suggested assertion change
-        assert component["hashes"][0]["alg"] == "SHA-256"
-        assert component["hashes"][0]["content"] == file_hashes[str(file_path)]
+        sha256_hash = next(hash_entry for hash_entry in component["hashes"] if hash_entry["alg"] == "SHA-256")
+        assert sha256_hash["content"] == file_hashes[str(file_path)]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_cli.py` around lines 855 - 861, The test currently assumes SHA-256
is at index 0 via component["hashes"][0], which breaks if hashes are reordered
or expanded; update the loop in tests/test_cli.py that iterates streamed_files
(using variables file_path, component, properties) to locate the hash entry with
alg == "SHA-256" (e.g. use a generator/filter or next(...) to find the matching
dict) and assert its "content" equals file_hashes[str(file_path)]; also fail the
test with a clear message if no SHA-256 entry is found instead of indexing into
position 0.

@mldangelo mldangelo changed the title fix: preserve Hugging Face artifacts in SBOM output fix: preserve streamed SBOM metadata and exclude HF cache files Mar 10, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
tests/test_cli.py (2)

854-860: ⚠️ Potential issue | 🟡 Minor

Don't couple this assertion to hash ordering.

component["hashes"][0] will fail if SBOM generation adds another hash or reorders the list while still emitting the correct SHA-256. Match the entry by alg instead.

Suggested assertion change
-        assert component["hashes"][0]["alg"] == "SHA-256"
-        assert component["hashes"][0]["content"] == file_hashes[str(file_path)]
+        sha256_hash = next(
+            hash_entry for hash_entry in component["hashes"] if hash_entry["alg"] == "SHA-256"
+        )
+        assert sha256_hash["content"] == file_hashes[str(file_path)]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_cli.py` around lines 854 - 860, The assertion is brittle because
it assumes the SHA-256 hash is always at index 0; change the check to locate the
hash entry by its "alg" value instead of relying on ordering: inside the loop
over streamed_files and using component = components[file_path.name], find the
dict in component["hashes"] where hash_entry["alg"] == "SHA-256" (e.g., with a
generator/loop or list comprehension) and then assert that that entry's
"content" equals file_hashes[str(file_path)]; keep the existing size assertion
unchanged.

791-793: 🛠️ Refactor suggestion | 🟠 Major

Add the required test annotations.

These new tests are still missing the repo-required -> None return annotation and tmp_path: Path typing.

As per coding guidelines, "Use type hints -> None on all test methods and tmp_path: Path / monkeypatch: pytest.MonkeyPatch on test parameters".

Also applies to: 867-869, 918-918

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_cli.py` around lines 791 - 793, The test function
test_scan_huggingface_streaming_sbom_includes_streamed_assets is missing type
hints: add a return annotation "-> None" and annotate the tmp_path parameter as
"tmp_path: Path"; ensure Path is imported (from pathlib import Path) at the top
of the test module. Apply the same pattern to the other test functions that lack
annotations (also missing "-> None" and tmp_path: Path / monkeypatch:
pytest.MonkeyPatch where applicable) so all test signatures follow the repo
guideline.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/test_cli.py`:
- Around line 854-860: The assertion is brittle because it assumes the SHA-256
hash is always at index 0; change the check to locate the hash entry by its
"alg" value instead of relying on ordering: inside the loop over streamed_files
and using component = components[file_path.name], find the dict in
component["hashes"] where hash_entry["alg"] == "SHA-256" (e.g., with a
generator/loop or list comprehension) and then assert that that entry's
"content" equals file_hashes[str(file_path)]; keep the existing size assertion
unchanged.
- Around line 791-793: The test function
test_scan_huggingface_streaming_sbom_includes_streamed_assets is missing type
hints: add a return annotation "-> None" and annotate the tmp_path parameter as
"tmp_path: Path"; ensure Path is imported (from pathlib import Path) at the top
of the test module. Apply the same pattern to the other test functions that lack
annotations (also missing "-> None" and tmp_path: Path / monkeypatch:
pytest.MonkeyPatch where applicable) so all test signatures follow the repo
guideline.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d9291b04-d81e-4727-8062-f3ffe4e9e785

📥 Commits

Reviewing files that changed from the base of the PR and between 00bb261 and 89e6aeb.

📒 Files selected for processing (2)
  • modelaudit/cli.py
  • tests/test_cli.py

@mldangelo mldangelo merged commit 49c7eca into main Mar 10, 2026
28 checks passed
@mldangelo mldangelo deleted the fix/streamed-sbom-artifacts branch March 10, 2026 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant