Filter out HNS directories from listing #149
Conversation
Greptile SummaryThis PR fixes a bug where ADLS/HNS directory stubs — blobs without a trailing
Confidence Score: 5/5Safe to merge; the new filtering logic is correct, the None-safety guard is properly short-circuited, and the prior self-parameter crash has been resolved. The core change — threading include=[metadata] into both listing calls and checking hdi_isfolder via a module-level helper — is logically sound and well-tested. The previously flagged self parameter crash is fixed, and the None guard is correct. The only remaining note is a defensive one about the size == 0 co-condition, which is not a current defect given that ADLS directory stubs reliably have size 0. No files require special attention; both changed files are straightforward and the test coverage for the new helper is adequate. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[list_files called] --> B{recursive?}
B -- yes --> C[list_blobs with include=metadata]
B -- no --> D[walk_blobs with include=metadata]
C --> E[for each item]
D --> F[for each item]
E --> G{hasattr name AND NOT _is_adls_directory}
F --> H{hasattr name AND NOT endswith slash AND NOT _is_adls_directory}
G -- pass --> I[append to paths]
H -- pass --> I
G -- filtered --> J[skip item]
H -- filtered --> J
I --> K[_filter_ignore paths asterisk-slash]
K --> L{allow_pattern?}
L -- yes --> M[_filter_allow]
L -- no --> N{ignore_pattern?}
M --> N
N -- yes --> O[_filter_ignore]
N -- no --> P[return results]
O --> P
Reviews (6): Last reviewed commit: "addressed comments" | Re-trigger Greptile |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
Walkthroughlist_files now requests blob metadata for both recursive and non-recursive listings and excludes ADLS directory stubs by checking zero-byte blobs with ChangesADLS Directory Filtering in File Listing
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
py/runai_model_streamer_azure/runai_model_streamer_azure/files/files.py (1)
176-190: ⚡ Quick winAdd regression coverage for zero-byte files vs. HNS directory stubs.
This branch now changes download eligibility based on
size == 0plus metadata. Please add focused tests for both recursive and non-recursive listing so we lock in that ordinary zero-byte files are kept whilehdi_isfolder=truestubs are excluded.Also applies to: 225-235
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@py/runai_model_streamer_azure/runai_model_streamer_azure/files/files.py` around lines 176 - 190, The listing logic now excludes ADLS directory stubs based on zero size plus metadata; add regression tests that cover both recursive (code path using container_client.list_blobs) and non‑recursive (container_client.walk_blobs) flows to assert that ordinary zero‑byte files (size == 0 without hdi_isfolder metadata) are returned while ADLS directory stubs (size == 0 with metadata hdi_isfolder=true) are excluded; create fixtures/mocked BlobProperties and BlobPrefix items for both cases and assert behavior for the functions that call _is_adls_directory and the two listing branches (the block referencing container_client.list_blobs and the similar block around container_client.walk_blobs, also mirrored in the second occurrence around the 225-235 region).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@py/runai_model_streamer_azure/runai_model_streamer_azure/files/files.py`:
- Around line 225-235: The helper function _is_adls_directory is defined with an
extra self parameter but is called with a single blob argument (e.g., from
glob() and pull_files()), causing a TypeError; change the signature of
_is_adls_directory to accept only a single parameter (blob: BlobProperties) and
keep the existing body intact so callers like glob() and pull_files() can pass
the blob without error.
---
Nitpick comments:
In `@py/runai_model_streamer_azure/runai_model_streamer_azure/files/files.py`:
- Around line 176-190: The listing logic now excludes ADLS directory stubs based
on zero size plus metadata; add regression tests that cover both recursive (code
path using container_client.list_blobs) and non‑recursive
(container_client.walk_blobs) flows to assert that ordinary zero‑byte files
(size == 0 without hdi_isfolder metadata) are returned while ADLS directory
stubs (size == 0 with metadata hdi_isfolder=true) are excluded; create
fixtures/mocked BlobProperties and BlobPrefix items for both cases and assert
behavior for the functions that call _is_adls_directory and the two listing
branches (the block referencing container_client.list_blobs and the similar
block around container_client.walk_blobs, also mirrored in the second occurrence
around the 225-235 region).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: a1db58a2-5a33-4e8f-b3aa-b0a9b8d7c75c
📒 Files selected for processing (1)
py/runai_model_streamer_azure/runai_model_streamer_azure/files/files.py
|
Just more of a heads up on this PR @noa-neria @hasethuraman ... @wonwuakpa-msft is on my team and this PR is aimed to address this issue encountered when trying to load from Azure Blob Storage accounts that have Azure Data Lake Storage (ADLS) Gen 2 enabled on them. Effectively, vLLM + model streamer crashes currently because ADLS Gen 2 returns directories back that look like files when listing blobs and will attempt to write those directories as files. I'll take lead at reviewing this and let you both know when it is ready for review. |
kyleknap
left a comment
There was a problem hiding this comment.
This is looking good. I just had a few smaller suggestions.
Looking more through the code, I'm thinking we can add tests in this file: https://github.com/run-ai/runai-model-streamer/blob/master/py/runai_model_streamer_azure/runai_model_streamer_azure/files/tests/test_files.py for now. Unfortunately, because azurite does not support ADLS: Azure/Azurite#553 we cannot necessarily spin up azurite enabled with ADLS to get the test coverage here. So, the next best option I'm thinking is we mock the BlobServiceClient and inject it into list_files to simulate the different test cases/behaviors of ADLS.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
py/runai_model_streamer_azure/runai_model_streamer_azure/files/tests/test_files.py (1)
60-72: ⚡ Quick winConsider testing the capitalized "Hdi_isfolder" metadata variant.
The production code's
_is_adls_directoryfunction checks bothhdi_isfolderandHdi_isfolder(capitalized) metadata keys, but this test only covers the lowercase variant. Adding a test blob with{"Hdi_isfolder": "true"}would ensure complete coverage of the ADLS directory detection logic.📋 Suggested test case addition
Add a test blob to the test_blobs list:
test_blobs = [ self.make_blob("file1.txt"), self.make_blob("file2.txt"), self.make_blob("dir1", size=0, metadata={"hdi_isfolder": "true"}), # ADLS directory stub + self.make_blob("dir1_cap", size=0, metadata={"Hdi_isfolder": "true"}), # ADLS directory stub (capitalized) self.make_blob("dir2"), self.make_blob("dir3/", size=0, metadata={"hdi_isfolder": "false"}), ]The expected result remains
["file1.txt", "file2.txt", "dir2"]sincedir1_capshould also be filtered out.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@py/runai_model_streamer_azure/runai_model_streamer_azure/files/tests/test_files.py` around lines 60 - 72, Add a test blob exercising the capitalized ADLS metadata key so the ADLS-directory detection logic is covered: in test_listfiles_recursive add a blob via make_blob (e.g. name "dir1_cap" or similar) with size=0 and metadata={"Hdi_isfolder":"true"} to the test_blobs list, keep client = self.make_mock_blob_client(..., recursive=True) and the call to files.list_files(..., recursive=True) unchanged, and keep the expected assertion result as ["file1.txt", "file2.txt", "dir2"] so the new ADLS-stub is also filtered out by the production _is_adls_directory logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In
`@py/runai_model_streamer_azure/runai_model_streamer_azure/files/tests/test_files.py`:
- Around line 60-72: Add a test blob exercising the capitalized ADLS metadata
key so the ADLS-directory detection logic is covered: in
test_listfiles_recursive add a blob via make_blob (e.g. name "dir1_cap" or
similar) with size=0 and metadata={"Hdi_isfolder":"true"} to the test_blobs
list, keep client = self.make_mock_blob_client(..., recursive=True) and the call
to files.list_files(..., recursive=True) unchanged, and keep the expected
assertion result as ["file1.txt", "file2.txt", "dir2"] so the new ADLS-stub is
also filtered out by the production _is_adls_directory logic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 08636b09-9d6d-4a87-8d5a-d4c47c089d25
📒 Files selected for processing (2)
py/runai_model_streamer_azure/runai_model_streamer_azure/files/files.pypy/runai_model_streamer_azure/runai_model_streamer_azure/files/tests/test_files.py
🚧 Files skipped from review as they are similar to previous changes (1)
- py/runai_model_streamer_azure/runai_model_streamer_azure/files/files.py
kyleknap
left a comment
There was a problem hiding this comment.
Looks good. I just had a few more smaller comments.
kyleknap
left a comment
There was a problem hiding this comment.
Just one last small comment. Otherwise, this looks good to me. I was also able to confirm that we are now able to load weights from an Azure storage account with ADLS Gen 2 enabled.
kyleknap
left a comment
There was a problem hiding this comment.
Looks great! 🚢 @noa-neria this should be ready for review on your end.
Description
Previously, directory stubs (that did not end with a trailing slash) were not being filtered properly. The API lists directories as blobs, which causes the model streamer to attempt to download directories as if they were files. This results in errors such as
IsADirectoryErrorwhen the streamer tries to create a file where a directory already existsThis PR introduces a mechanism to rely on the blob directory metatdata
hdi_isfolderto determine the entity type (either directory or file)Testing:
Added unit tests,
Manually imported the module for validation
Summary by CodeRabbit
Bug Fixes
Tests