fix: preserve streamed SBOM metadata and exclude HF cache files#673
fix: preserve streamed SBOM metadata and exclude HF cache files#673
Conversation
Fix streamed SBOM generation so deleted streamed artifacts still retain component size and SHA-256 metadata. Also exclude Hugging Face cache bookkeeping files from remote SBOMs and asset lists while preserving real downloaded artifacts like merges.txt.
WalkthroughThis PR fixes SBOM generation with Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelaudit/core.py`:
- Around line 666-670: The skip for HuggingFace cache files is applied too
broadly because _is_huggingface_cache_file(file_path) matches names like "main"
or "HEAD" regardless of location; change the check so we only skip when the file
is actually inside a HuggingFace cache directory. Update the branch where
file_path is tested (the code calling _is_huggingface_cache_file) to require
both that _is_huggingface_cache_file(file_path) is true and that the path is
inside a HF cache root (e.g., inspect pathlib.Path(file_path).parents for known
cache markers like ".cache/huggingface" or implement a helper
_is_within_hf_cache_dir(file_path) and call that together with
_is_huggingface_cache_file), or alternatively modify _is_huggingface_cache_file
to perform the parent-directory check itself; only then log and continue.
In `@modelaudit/integrations/sbom_generator.py`:
- Around line 394-395: The SBOM skip predicate (_should_skip_sbom_file) is only
applied when walking directories, so the flat list paths_for_sbom and the
single-file/else branches still include .metadata/.gitignore/lock files; update
the code paths that iterate over paths_for_sbom and the single-file emission
branch to call _should_skip_sbom_file(fp) (or the local file variable) and
continue when it returns True, ensuring the same skip behavior for streamed/flat
inputs as for directory walks.
In `@tests/test_cli.py`:
- Around line 791-793: The test function
test_scan_huggingface_streaming_sbom_includes_streamed_assets is missing the
required typing: add a return type annotation "-> None" and annotate the
tmp_path parameter as "tmp_path: Path" (import Path from pathlib if not
already). Apply the same change (add "-> None" and type tmp_path as Path or
monkeypatch as pytest.MonkeyPatch where appropriate) to the other test functions
flagged in the review so all tests follow the repo convention.
- Around line 855-861: The test currently assumes SHA-256 is at index 0 via
component["hashes"][0], which breaks if hashes are reordered or expanded; update
the loop in tests/test_cli.py that iterates streamed_files (using variables
file_path, component, properties) to locate the hash entry with alg == "SHA-256"
(e.g. use a generator/filter or next(...) to find the matching dict) and assert
its "content" equals file_hashes[str(file_path)]; also fail the test with a
clear message if no SHA-256 entry is found instead of indexing into position 0.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: b93b86ae-ea1f-47ea-9d1b-1aed3d2b97e2
📒 Files selected for processing (5)
CHANGELOG.mdmodelaudit/cli.pymodelaudit/core.pymodelaudit/integrations/sbom_generator.pytests/test_cli.py
| # HuggingFace cache bookkeeping files should never surface as | ||
| # scan assets or SBOM components for downloaded models. | ||
| if _is_huggingface_cache_file(file_path): | ||
| logger.debug(f"Skipping HuggingFace cache file: {file_path}") | ||
| continue |
There was a problem hiding this comment.
Don't apply the Hugging Face cache skip globally.
_is_huggingface_cache_file() matches names like main and HEAD regardless of location. Because this new branch runs for every directory scan, a normal local artifact with one of those names now disappears from scanning and SBOM generation.
Proposed fix
- if _is_huggingface_cache_file(file_path):
+ if is_hf_cache and _is_huggingface_cache_file(file_path):
logger.debug(f"Skipping HuggingFace cache file: {file_path}")
continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelaudit/core.py` around lines 666 - 670, The skip for HuggingFace cache
files is applied too broadly because _is_huggingface_cache_file(file_path)
matches names like "main" or "HEAD" regardless of location; change the check so
we only skip when the file is actually inside a HuggingFace cache directory.
Update the branch where file_path is tested (the code calling
_is_huggingface_cache_file) to require both that
_is_huggingface_cache_file(file_path) is true and that the path is inside a HF
cache root (e.g., inspect pathlib.Path(file_path).parents for known cache
markers like ".cache/huggingface" or implement a helper
_is_within_hf_cache_dir(file_path) and call that together with
_is_huggingface_cache_file), or alternatively modify _is_huggingface_cache_file
to perform the parent-directory check itself; only then log and continue.
| if _should_skip_sbom_file(fp): | ||
| continue |
There was a problem hiding this comment.
Apply the SBOM skip filter to single-file inputs too.
This only filters directory walks. In the new streamed flow, paths_for_sbom is a flat list of asset files, so the else branches below still emit .metadata/.gitignore/lock files if one reaches the asset list.
Proposed fix
else:
+ if _should_skip_sbom_file(input_path):
+ continue
meta_model = file_meta.get(input_path)
# Convert Pydantic model to dict if needed
if meta_model is not None and hasattr(meta_model, "model_dump"):
meta = meta_model.model_dump()
else:
meta = meta_model or {}
component = _component_for_file(input_path, meta, issues_dicts)
bom.components.add(component) else:
+ if _should_skip_sbom_file(input_path):
+ continue
metadata = file_metadata.get(input_path)
component = _component_for_file_pydantic(input_path, metadata, issues)
bom.components.add(component)Also applies to: 436-437
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelaudit/integrations/sbom_generator.py` around lines 394 - 395, The SBOM
skip predicate (_should_skip_sbom_file) is only applied when walking
directories, so the flat list paths_for_sbom and the single-file/else branches
still include .metadata/.gitignore/lock files; update the code paths that
iterate over paths_for_sbom and the single-file emission branch to call
_should_skip_sbom_file(fp) (or the local file variable) and continue when it
returns True, ensuring the same skip behavior for streamed/flat inputs as for
directory walks.
| def test_scan_huggingface_streaming_sbom_includes_streamed_assets( | ||
| mock_scan_streaming, mock_download_streaming, mock_is_hf_url, tmp_path | ||
| ): |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Add the required test annotations.
These new tests are missing the repo-required -> None return annotations and tmp_path: Path parameter typing.
As per coding guidelines, "Use type hints -> None on all test methods and tmp_path: Path / monkeypatch: pytest.MonkeyPatch on test parameters".
Also applies to: 868-870, 919-919
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_cli.py` around lines 791 - 793, The test function
test_scan_huggingface_streaming_sbom_includes_streamed_assets is missing the
required typing: add a return type annotation "-> None" and annotate the
tmp_path parameter as "tmp_path: Path" (import Path from pathlib if not
already). Apply the same change (add "-> None" and type tmp_path as Path or
monkeypatch as pytest.MonkeyPatch where appropriate) to the other test functions
flagged in the review so all tests follow the repo convention.
| for file_path in streamed_files: | ||
| component = components[file_path.name] | ||
| properties = {prop["name"]: prop["value"] for prop in component["properties"]} | ||
|
|
||
| assert properties["size"] == str(file_sizes[str(file_path)]) | ||
| assert component["hashes"][0]["alg"] == "SHA-256" | ||
| assert component["hashes"][0]["content"] == file_hashes[str(file_path)] |
There was a problem hiding this comment.
Don't couple the test to hash ordering.
component["hashes"][0] makes this test fail if SBOM generation adds another hash or reorders the list while still emitting the correct SHA-256. Match the hash by alg instead.
Suggested assertion change
- assert component["hashes"][0]["alg"] == "SHA-256"
- assert component["hashes"][0]["content"] == file_hashes[str(file_path)]
+ sha256_hash = next(hash_entry for hash_entry in component["hashes"] if hash_entry["alg"] == "SHA-256")
+ assert sha256_hash["content"] == file_hashes[str(file_path)]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_cli.py` around lines 855 - 861, The test currently assumes SHA-256
is at index 0 via component["hashes"][0], which breaks if hashes are reordered
or expanded; update the loop in tests/test_cli.py that iterates streamed_files
(using variables file_path, component, properties) to locate the hash entry with
alg == "SHA-256" (e.g. use a generator/filter or next(...) to find the matching
dict) and assert its "content" equals file_hashes[str(file_path)]; also fail the
test with a clear message if no SHA-256 entry is found instead of indexing into
position 0.
There was a problem hiding this comment.
♻️ Duplicate comments (2)
tests/test_cli.py (2)
854-860:⚠️ Potential issue | 🟡 MinorDon't couple this assertion to hash ordering.
component["hashes"][0]will fail if SBOM generation adds another hash or reorders the list while still emitting the correct SHA-256. Match the entry byalginstead.Suggested assertion change
- assert component["hashes"][0]["alg"] == "SHA-256" - assert component["hashes"][0]["content"] == file_hashes[str(file_path)] + sha256_hash = next( + hash_entry for hash_entry in component["hashes"] if hash_entry["alg"] == "SHA-256" + ) + assert sha256_hash["content"] == file_hashes[str(file_path)]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_cli.py` around lines 854 - 860, The assertion is brittle because it assumes the SHA-256 hash is always at index 0; change the check to locate the hash entry by its "alg" value instead of relying on ordering: inside the loop over streamed_files and using component = components[file_path.name], find the dict in component["hashes"] where hash_entry["alg"] == "SHA-256" (e.g., with a generator/loop or list comprehension) and then assert that that entry's "content" equals file_hashes[str(file_path)]; keep the existing size assertion unchanged.
791-793: 🛠️ Refactor suggestion | 🟠 MajorAdd the required test annotations.
These new tests are still missing the repo-required
-> Nonereturn annotation andtmp_path: Pathtyping.As per coding guidelines, "Use type hints
-> Noneon all test methods andtmp_path: Path/monkeypatch: pytest.MonkeyPatchon test parameters".Also applies to: 867-869, 918-918
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_cli.py` around lines 791 - 793, The test function test_scan_huggingface_streaming_sbom_includes_streamed_assets is missing type hints: add a return annotation "-> None" and annotate the tmp_path parameter as "tmp_path: Path"; ensure Path is imported (from pathlib import Path) at the top of the test module. Apply the same pattern to the other test functions that lack annotations (also missing "-> None" and tmp_path: Path / monkeypatch: pytest.MonkeyPatch where applicable) so all test signatures follow the repo guideline.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@tests/test_cli.py`:
- Around line 854-860: The assertion is brittle because it assumes the SHA-256
hash is always at index 0; change the check to locate the hash entry by its
"alg" value instead of relying on ordering: inside the loop over streamed_files
and using component = components[file_path.name], find the dict in
component["hashes"] where hash_entry["alg"] == "SHA-256" (e.g., with a
generator/loop or list comprehension) and then assert that that entry's
"content" equals file_hashes[str(file_path)]; keep the existing size assertion
unchanged.
- Around line 791-793: The test function
test_scan_huggingface_streaming_sbom_includes_streamed_assets is missing type
hints: add a return annotation "-> None" and annotate the tmp_path parameter as
"tmp_path: Path"; ensure Path is imported (from pathlib import Path) at the top
of the test module. Apply the same pattern to the other test functions that lack
annotations (also missing "-> None" and tmp_path: Path / monkeypatch:
pytest.MonkeyPatch where applicable) so all test signatures follow the repo
guideline.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: d9291b04-d81e-4727-8062-f3ffe4e9e785
📒 Files selected for processing (2)
modelaudit/cli.pytests/test_cli.py
Follow-up to #672.
Summary
sizeand SHA-256 metadata for streamed SBOM entries even after streamed files are deleted.metadataand.gitignorefrom remote SBOMs and asset listsmerges.txtin non-streamed Hugging Face SBOMsValidation
uv run ruff format modelaudit/ tests/uv run ruff check --fix modelaudit/ tests/uv run mypy modelaudit/uv run pytest -n auto -m "not slow and not integration" --maxfail=1End-to-end checks
uv run modelaudit scan --stream --quiet --format sarif --output .../recheck-gpt2-stream.sarif --sbom .../recheck-gpt2-stream.sbom.json hf://openai-community/gpt2uv run modelaudit scan --quiet --format json --output .../recheck-tiny-full.json --sbom .../recheck-tiny-full.sbom.json hf://sshleifer/tiny-gpt2merges.txtpresent in the SBOM and no.metadata/.gitignorecache componentsNotes
The full pytest command hit an unrelated one-off timing failure in
tests/test_cache_optimizations.py::TestCacheOptimizationPerformance::test_configuration_extraction_performanceon the first run; the isolated test passed immediately and the full command passed on rerun.Summary by CodeRabbit
Bug Fixes
Tests