fix: preserve streamed SBOM metadata and exclude HF cache files by mldangelo · Pull Request #673 · promptfoo/modelaudit

mldangelo · 2026-03-10T15:47:41Z

Follow-up to #672.

Summary

preserve per-component size and SHA-256 metadata for streamed SBOM entries even after streamed files are deleted
keep generating streamed SBOMs from the scanned streamed artifact list
exclude Hugging Face cache bookkeeping files like .metadata and .gitignore from remote SBOMs and asset lists
preserve real downloaded artifacts such as merges.txt in non-streamed Hugging Face SBOMs

Validation

uv run ruff format modelaudit/ tests/
uv run ruff check --fix modelaudit/ tests/
uv run mypy modelaudit/
uv run pytest -n auto -m "not slow and not integration" --maxfail=1

End-to-end checks

uv run modelaudit scan --stream --quiet --format sarif --output .../recheck-gpt2-stream.sarif --sbom .../recheck-gpt2-stream.sbom.json hf://openai-community/gpt2
- result: 24 SBOM components, no zero-size components, no missing hashes, SARIF results = 0
uv run modelaudit scan --quiet --format json --output .../recheck-tiny-full.json --sbom .../recheck-tiny-full.sbom.json hf://sshleifer/tiny-gpt2
- result: merges.txt present in the SBOM and no .metadata / .gitignore cache components

Notes

The full pytest command hit an unrelated one-off timing failure in tests/test_cache_optimizations.py::TestCacheOptimizationPerformance::test_configuration_extraction_performance on the first run; the isolated test passed immediately and the full command passed on rerun.

Summary by CodeRabbit

Bug Fixes
- Streamed artifacts are now correctly included in SBOMs when using streaming scans.
- HuggingFace download cache bookkeeping files are excluded from SBOMs, asset lists, and directory scans.
- SBOM components now reliably surface accurate file sizes and SHA-256 hashes.
Tests
- Expanded tests to cover streamed SBOMs, per-file metadata, and exclusion of cache bookkeeping files.

Fix streamed SBOM generation so deleted streamed artifacts still retain component size and SHA-256 metadata. Also exclude Hugging Face cache bookkeeping files from remote SBOMs and asset lists while preserving real downloaded artifacts like merges.txt.

coderabbitai · 2026-03-10T15:47:59Z

Walkthrough

This PR fixes SBOM generation with --stream by selecting streamed asset paths for SBOMs, skipping HuggingFace cache bookkeeping files during scans, enriching per-file metadata with size and sha256, and adding helpers to resolve component size/hash and filter irrelevant SBOM files.

Changes

Cohort / File(s)	Summary
CLI SBOM path selection `modelaudit/cli.py`	Use deduplicated streamed asset paths as the SBOM paths when assets exist and `final_scan_and_delete` is true; otherwise retain fallback to scanned/expanded paths.
Core scanning & streaming merge `modelaudit/core.py`	Skip HuggingFace download-cache bookkeeping files during directory iteration (with debug logs). When merging streamed scan results, inject per-file `file_size`, ensure `sha256` exists in `file_hashes`, and attach `file_metadata` to merged results.
SBOM generation helpers & filtering `modelaudit/integrations/sbom_generator.py`	Add `_resolve_component_size_and_sha256(path, metadata)` to prefer on-disk size/hash with metadata fallback, and `_should_skip_sbom_file(path)` to filter cache/metadata/lock/git-ref files. Use these helpers in both Pydantic and non-Pydantic component generation flows and skip irrelevant files early.
Tests `tests/test_cli.py`	Extend mock scan helper to accept `file_metadata`; add/adjust tests to validate streamed assets appear as SBOM components with correct size/sha256 and that HuggingFace cache bookkeeping files are excluded in streamed and directory scans.
Changelog `CHANGELOG.md`	Documented fixes: include streamed artifacts as SBOM components with `--stream --sbom`, and exclude HuggingFace download cache bookkeeping files from remote SBOMs and asset lists.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped through streams and chased each file,

sizes and sha256s now sitting in a pile.
Cache crumbs I skipped with a twinkle and grin,
SBOMs now list the true artifacts within.
Hooray — the rabbit fixed the pipeline spin! 🥕

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main changes: preserving streamed SBOM metadata and excluding HuggingFace cache files, which are the core fixes in this PR.
Linked Issues check	✅ Passed	Code changes fully address issue `#671` requirements: preserve streamed artifact metadata with size/SHA-256, generate SBOMs from scanned assets, and exclude HuggingFace cache bookkeeping files.
Out of Scope Changes check	✅ Passed	All changes are directly related to fixing the SBOM streaming issue and HuggingFace cache handling; no unrelated functionality was introduced.
Docstring Coverage	✅ Passed	Docstring coverage is 81.25% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/streamed-sbom-artifacts

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/core.py`:
- Around line 666-670: The skip for HuggingFace cache files is applied too
broadly because _is_huggingface_cache_file(file_path) matches names like "main"
or "HEAD" regardless of location; change the check so we only skip when the file
is actually inside a HuggingFace cache directory. Update the branch where
file_path is tested (the code calling _is_huggingface_cache_file) to require
both that _is_huggingface_cache_file(file_path) is true and that the path is
inside a HF cache root (e.g., inspect pathlib.Path(file_path).parents for known
cache markers like ".cache/huggingface" or implement a helper
_is_within_hf_cache_dir(file_path) and call that together with
_is_huggingface_cache_file), or alternatively modify _is_huggingface_cache_file
to perform the parent-directory check itself; only then log and continue.

In `@modelaudit/integrations/sbom_generator.py`:
- Around line 394-395: The SBOM skip predicate (_should_skip_sbom_file) is only
applied when walking directories, so the flat list paths_for_sbom and the
single-file/else branches still include .metadata/.gitignore/lock files; update
the code paths that iterate over paths_for_sbom and the single-file emission
branch to call _should_skip_sbom_file(fp) (or the local file variable) and
continue when it returns True, ensuring the same skip behavior for streamed/flat
inputs as for directory walks.

In `@tests/test_cli.py`:
- Around line 791-793: The test function
test_scan_huggingface_streaming_sbom_includes_streamed_assets is missing the
required typing: add a return type annotation "-> None" and annotate the
tmp_path parameter as "tmp_path: Path" (import Path from pathlib if not
already). Apply the same change (add "-> None" and type tmp_path as Path or
monkeypatch as pytest.MonkeyPatch where appropriate) to the other test functions
flagged in the review so all tests follow the repo convention.
- Around line 855-861: The test currently assumes SHA-256 is at index 0 via
component["hashes"][0], which breaks if hashes are reordered or expanded; update
the loop in tests/test_cli.py that iterates streamed_files (using variables
file_path, component, properties) to locate the hash entry with alg == "SHA-256"
(e.g. use a generator/filter or next(...) to find the matching dict) and assert
its "content" equals file_hashes[str(file_path)]; also fail the test with a
clear message if no SHA-256 entry is found instead of indexing into position 0.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: b93b86ae-ea1f-47ea-9d1b-1aed3d2b97e2

📥 Commits

Reviewing files that changed from the base of the PR and between 9d77d50 and 00bb261.

📒 Files selected for processing (5)

CHANGELOG.md
modelaudit/cli.py
modelaudit/core.py
modelaudit/integrations/sbom_generator.py
tests/test_cli.py

coderabbitai · 2026-03-10T15:53:24Z

modelaudit/core.py

+                    # HuggingFace cache bookkeeping files should never surface as
+                    # scan assets or SBOM components for downloaded models.
+                    if _is_huggingface_cache_file(file_path):
+                        logger.debug(f"Skipping HuggingFace cache file: {file_path}")
+                        continue


⚠️ Potential issue | 🟠 Major

Don't apply the Hugging Face cache skip globally.

_is_huggingface_cache_file() matches names like main and HEAD regardless of location. Because this new branch runs for every directory scan, a normal local artifact with one of those names now disappears from scanning and SBOM generation.

Proposed fix

- if _is_huggingface_cache_file(file_path): + if is_hf_cache and _is_huggingface_cache_file(file_path): logger.debug(f"Skipping HuggingFace cache file: {file_path}") continue

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelaudit/core.py` around lines 666 - 670, The skip for HuggingFace cache files is applied too broadly because _is_huggingface_cache_file(file_path) matches names like "main" or "HEAD" regardless of location; change the check so we only skip when the file is actually inside a HuggingFace cache directory. Update the branch where file_path is tested (the code calling _is_huggingface_cache_file) to require both that _is_huggingface_cache_file(file_path) is true and that the path is inside a HF cache root (e.g., inspect pathlib.Path(file_path).parents for known cache markers like ".cache/huggingface" or implement a helper _is_within_hf_cache_dir(file_path) and call that together with _is_huggingface_cache_file), or alternatively modify _is_huggingface_cache_file to perform the parent-directory check itself; only then log and continue.

coderabbitai · 2026-03-10T15:53:24Z

modelaudit/integrations/sbom_generator.py

+                    if _should_skip_sbom_file(fp):
+                        continue


⚠️ Potential issue | 🟠 Major

Apply the SBOM skip filter to single-file inputs too.

This only filters directory walks. In the new streamed flow, paths_for_sbom is a flat list of asset files, so the else branches below still emit .metadata/.gitignore/lock files if one reaches the asset list.

Proposed fix

else: + if _should_skip_sbom_file(input_path): + continue meta_model = file_meta.get(input_path) # Convert Pydantic model to dict if needed if meta_model is not None and hasattr(meta_model, "model_dump"): meta = meta_model.model_dump() else: meta = meta_model or {} component = _component_for_file(input_path, meta, issues_dicts) bom.components.add(component)

else: + if _should_skip_sbom_file(input_path): + continue metadata = file_metadata.get(input_path) component = _component_for_file_pydantic(input_path, metadata, issues) bom.components.add(component)

Also applies to: 436-437

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelaudit/integrations/sbom_generator.py` around lines 394 - 395, The SBOM skip predicate (_should_skip_sbom_file) is only applied when walking directories, so the flat list paths_for_sbom and the single-file/else branches still include .metadata/.gitignore/lock files; update the code paths that iterate over paths_for_sbom and the single-file emission branch to call _should_skip_sbom_file(fp) (or the local file variable) and continue when it returns True, ensuring the same skip behavior for streamed/flat inputs as for directory walks.

coderabbitai · 2026-03-10T15:53:24Z

tests/test_cli.py

+def test_scan_huggingface_streaming_sbom_includes_streamed_assets(
+    mock_scan_streaming, mock_download_streaming, mock_is_hf_url, tmp_path
+):


🛠️ Refactor suggestion | 🟠 Major

Add the required test annotations.

These new tests are missing the repo-required -> None return annotations and tmp_path: Path parameter typing.

As per coding guidelines, "Use type hints -> None on all test methods and tmp_path: Path / monkeypatch: pytest.MonkeyPatch on test parameters".

Also applies to: 868-870, 919-919

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/test_cli.py` around lines 791 - 793, The test function test_scan_huggingface_streaming_sbom_includes_streamed_assets is missing the required typing: add a return type annotation "-> None" and annotate the tmp_path parameter as "tmp_path: Path" (import Path from pathlib if not already). Apply the same change (add "-> None" and type tmp_path as Path or monkeypatch as pytest.MonkeyPatch where appropriate) to the other test functions flagged in the review so all tests follow the repo convention.

coderabbitai · 2026-03-10T15:53:25Z

tests/test_cli.py

+    for file_path in streamed_files:
+        component = components[file_path.name]
+        properties = {prop["name"]: prop["value"] for prop in component["properties"]}
+
+        assert properties["size"] == str(file_sizes[str(file_path)])
+        assert component["hashes"][0]["alg"] == "SHA-256"
+        assert component["hashes"][0]["content"] == file_hashes[str(file_path)]


⚠️ Potential issue | 🟡 Minor

Don't couple the test to hash ordering.

component["hashes"][0] makes this test fail if SBOM generation adds another hash or reorders the list while still emitting the correct SHA-256. Match the hash by alg instead.

Suggested assertion change

- assert component["hashes"][0]["alg"] == "SHA-256" - assert component["hashes"][0]["content"] == file_hashes[str(file_path)] + sha256_hash = next(hash_entry for hash_entry in component["hashes"] if hash_entry["alg"] == "SHA-256") + assert sha256_hash["content"] == file_hashes[str(file_path)]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/test_cli.py` around lines 855 - 861, The test currently assumes SHA-256 is at index 0 via component["hashes"][0], which breaks if hashes are reordered or expanded; update the loop in tests/test_cli.py that iterates streamed_files (using variables file_path, component, properties) to locate the hash entry with alg == "SHA-256" (e.g. use a generator/filter or next(...) to find the matching dict) and assert its "content" equals file_hashes[str(file_path)]; also fail the test with a clear message if no SHA-256 entry is found instead of indexing into position 0.

coderabbitai

♻️ Duplicate comments (2)

tests/test_cli.py (2)

854-860: ⚠️ Potential issue | 🟡 Minor

Don't couple this assertion to hash ordering.

component["hashes"][0] will fail if SBOM generation adds another hash or reorders the list while still emitting the correct SHA-256. Match the entry by alg instead.

Suggested assertion change

-        assert component["hashes"][0]["alg"] == "SHA-256"
-        assert component["hashes"][0]["content"] == file_hashes[str(file_path)]
+        sha256_hash = next(
+            hash_entry for hash_entry in component["hashes"] if hash_entry["alg"] == "SHA-256"
+        )
+        assert sha256_hash["content"] == file_hashes[str(file_path)]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_cli.py` around lines 854 - 860, The assertion is brittle because
it assumes the SHA-256 hash is always at index 0; change the check to locate the
hash entry by its "alg" value instead of relying on ordering: inside the loop
over streamed_files and using component = components[file_path.name], find the
dict in component["hashes"] where hash_entry["alg"] == "SHA-256" (e.g., with a
generator/loop or list comprehension) and then assert that that entry's
"content" equals file_hashes[str(file_path)]; keep the existing size assertion
unchanged.

791-793: 🛠️ Refactor suggestion | 🟠 Major

Add the required test annotations.

These new tests are still missing the repo-required -> None return annotation and tmp_path: Path typing.

As per coding guidelines, "Use type hints -> None on all test methods and tmp_path: Path / monkeypatch: pytest.MonkeyPatch on test parameters".

Also applies to: 867-869, 918-918

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_cli.py` around lines 791 - 793, The test function
test_scan_huggingface_streaming_sbom_includes_streamed_assets is missing type
hints: add a return annotation "-> None" and annotate the tmp_path parameter as
"tmp_path: Path"; ensure Path is imported (from pathlib import Path) at the top
of the test module. Apply the same pattern to the other test functions that lack
annotations (also missing "-> None" and tmp_path: Path / monkeypatch:
pytest.MonkeyPatch where applicable) so all test signatures follow the repo
guideline.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/test_cli.py`:
- Around line 854-860: The assertion is brittle because it assumes the SHA-256
hash is always at index 0; change the check to locate the hash entry by its
"alg" value instead of relying on ordering: inside the loop over streamed_files
and using component = components[file_path.name], find the dict in
component["hashes"] where hash_entry["alg"] == "SHA-256" (e.g., with a
generator/loop or list comprehension) and then assert that that entry's
"content" equals file_hashes[str(file_path)]; keep the existing size assertion
unchanged.
- Around line 791-793: The test function
test_scan_huggingface_streaming_sbom_includes_streamed_assets is missing type
hints: add a return annotation "-> None" and annotate the tmp_path parameter as
"tmp_path: Path"; ensure Path is imported (from pathlib import Path) at the top
of the test module. Apply the same pattern to the other test functions that lack
annotations (also missing "-> None" and tmp_path: Path / monkeypatch:
pytest.MonkeyPatch where applicable) so all test signatures follow the repo
guideline.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d9291b04-d81e-4727-8062-f3ffe4e9e785

📥 Commits

Reviewing files that changed from the base of the PR and between 00bb261 and 89e6aeb.

📒 Files selected for processing (2)

modelaudit/cli.py
tests/test_cli.py

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

merge: bring #672 into streamed SBOM follow-up

89e6aeb

mldangelo changed the title ~~fix: preserve Hugging Face artifacts in SBOM output~~ fix: preserve streamed SBOM metadata and exclude HF cache files Mar 10, 2026

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

mldangelo merged commit 49c7eca into main Mar 10, 2026
28 checks passed

mldangelo deleted the fix/streamed-sbom-artifacts branch March 10, 2026 16:35

github-actions bot mentioned this pull request Mar 10, 2026

chore(main): release 0.2.28 #647

Open

Daketey mentioned this pull request Mar 12, 2026

bug: Not able to generate complete SBOM with --stream flag enabled #671

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: preserve streamed SBOM metadata and exclude HF cache files#673

fix: preserve streamed SBOM metadata and exclude HF cache files#673
mldangelo merged 2 commits intomainfrom
fix/streamed-sbom-artifacts

mldangelo commented Mar 10, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 10, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 10, 2026

Uh oh!

coderabbitai bot Mar 10, 2026

Uh oh!

coderabbitai bot Mar 10, 2026

Uh oh!

coderabbitai bot Mar 10, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mldangelo commented Mar 10, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

End-to-end checks

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mldangelo commented Mar 10, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 10, 2026 •

edited

Loading