fix: route dataset manifest and file preview through service for private buckets#795
fix: route dataset manifest and file preview through service for private buckets#795KeitaW wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
…t private buckets (NVIDIA#793) The UI's fetchManifest and file preview proxy performed unsigned fetch() against S3 HTTPS URLs, which fails with 403 on private buckets. Added two service-side proxy endpoints: - GET /{bucket}/dataset/{name}/manifest — reads manifest JSON from storage using bucket credentials (supports S3, GCS, Azure, Swift, TOS) - GET /{bucket}/dataset/{name}/file-content — streams individual file content with storage_path validation against dataset container Updated UI to call these endpoints through the existing /api catch-all proxy instead of direct unsigned fetch. Removed unused fetchManifest server action files.
The file preview panel sends a HEAD request before GET to check content-type and access. FastAPI's @router.get does not handle HEAD, returning 405 Method Not Allowed. Changed to @router.api_route with both GET and HEAD methods.
fetchDatasetFiles used a relative fetch('/api/bucket/...') which fails
during SSR because the request goes through the proxy without browser
auth cookies, resulting in 403 from the API gateway.
Re-introduced the server action pattern: fetchManifest now calls the
backend service directly using getServerApiBaseUrl() (internal URL),
bypassing the auth gateway. Works for both SSR and client hydration.
📝 WalkthroughWalkthroughThe PR adds backend API endpoints for fetching dataset manifests and file content, shifting manifest and file access from direct unsigned frontend fetches against storage to authenticated service-proxied requests that resolve private bucket access failures. Changes
Sequence DiagramsequenceDiagram
actor Browser as Browser/UI
participant Proxy as Frontend<br/>Proxy Route
participant Service as Backend<br/>Service
participant Storage as Object<br/>Storage (S3)
Note over Browser,Storage: New Flow: Service-Proxied Authenticated Access
Browser->>Proxy: GET /proxy/dataset/file?bucket=X&name=Y&storagePath=...
Proxy->>Service: GET /api/bucket/X/dataset/Y/file-content?storage_path=...
Note over Service: Validate storage_path container<br/>matches dataset hash_location
Service->>Storage: fetch(signed_url_or_creds)
Storage-->>Service: File bytes
Service-->>Proxy: StreamingResponse<br/>(inferred mime-type)
Proxy-->>Browser: Forwarded response<br/>with headers
rect rgba(100, 200, 100, 0.5)
Note over Browser,Storage: Manifest Flow
Browser->>Proxy: Manifest request
Proxy->>Service: GET /api/bucket/X/dataset/Y/manifest?version=V
Service->>Storage: Load manifest from version location
Storage-->>Service: JSON manifest
Service-->>Browser: Parsed manifest
end
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/service/core/data/data_service.py`:
- Around line 1032-1036: The current check only compares
requested_backend.container to dataset_backend.container (constructed via
storage.construct_storage_backend), which allows any path in the same container;
instead validate that the supplied storage_path is within the dataset's storage
prefix by ensuring storage_path (or the backend's full path) starts with
dataset_info.hash_location (or its normalized prefix) before proceeding; if it
does not, raise osmo_errors.OSMOUserError with a clear message. Normalize both
paths (resolve trailing slashes, case/URL encoding as appropriate) prior to the
startswith check and keep the container equality check as a fast precondition.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e0ab25ae-7964-4301-9a03-e53a0d59233a
📒 Files selected for processing (10)
src/service/core/data/data_service.pysrc/ui/next.config.tssrc/ui/src/app/proxy/dataset/file/route.impl.production.tssrc/ui/src/features/datasets/detail/components/dataset-detail-content.tsxsrc/ui/src/features/datasets/detail/components/file-preview-panel.tsxsrc/ui/src/lib/api/adapter/datasets-hooks.tssrc/ui/src/lib/api/adapter/datasets.tssrc/ui/src/lib/api/server/dataset-actions.production.tssrc/ui/src/lib/api/server/dataset-actions.tssrc/ui/src/mocks/handlers.ts
💤 Files with no reviewable changes (2)
- src/ui/next.config.ts
- src/ui/src/lib/api/server/dataset-actions.ts
| requested_backend = storage.construct_storage_backend(storage_path) | ||
| dataset_backend = storage.construct_storage_backend(dataset_info.hash_location) | ||
| if requested_backend.container != dataset_backend.container: | ||
| raise osmo_errors.OSMOUserError( | ||
| 'Storage path does not belong to this dataset.') |
There was a problem hiding this comment.
Container-only validation may be overly permissive.
The validation only checks that requested_backend.container matches dataset_backend.container. This allows access to any file within the same container, not just files belonging to this specific dataset's storage prefix. Consider validating that storage_path starts with dataset_info.hash_location to restrict access to only the dataset's files.
🛡️ Proposed stricter path validation
# Validate that the storage path belongs to this dataset's storage prefix
requested_backend = storage.construct_storage_backend(storage_path)
dataset_backend = storage.construct_storage_backend(dataset_info.hash_location)
- if requested_backend.container != dataset_backend.container:
+ if requested_backend.container != dataset_backend.container or \
+ not storage_path.startswith(dataset_info.hash_location.rstrip('/')):
raise osmo_errors.OSMOUserError(
'Storage path does not belong to this dataset.')📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| requested_backend = storage.construct_storage_backend(storage_path) | |
| dataset_backend = storage.construct_storage_backend(dataset_info.hash_location) | |
| if requested_backend.container != dataset_backend.container: | |
| raise osmo_errors.OSMOUserError( | |
| 'Storage path does not belong to this dataset.') | |
| requested_backend = storage.construct_storage_backend(storage_path) | |
| dataset_backend = storage.construct_storage_backend(dataset_info.hash_location) | |
| if requested_backend.container != dataset_backend.container or \ | |
| not storage_path.startswith(dataset_info.hash_location.rstrip('/')): | |
| raise osmo_errors.OSMOUserError( | |
| 'Storage path does not belong to this dataset.') |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/service/core/data/data_service.py` around lines 1032 - 1036, The current
check only compares requested_backend.container to dataset_backend.container
(constructed via storage.construct_storage_backend), which allows any path in
the same container; instead validate that the supplied storage_path is within
the dataset's storage prefix by ensuring storage_path (or the backend's full
path) starts with dataset_info.hash_location (or its normalized prefix) before
proceeding; if it does not, raise osmo_errors.OSMOUserError with a clear
message. Normalize both paths (resolve trailing slashes, case/URL encoding as
appropriate) prior to the startswith check and keep the container equality check
as a fast precondition.
Summary
Fixes #793 — dataset file browser and preview broken on private S3 buckets.
GET /{bucket}/dataset/{name}/manifestservice endpoint that reads manifest JSON from storage usingbucket_config.default_credential(supports S3, GCS, Azure, Swift, TOS)GET /{bucket}/dataset/{name}/file-contentservice endpoint that streams individual file content with storage path validationstoragePathis availableTest plan
pnpm type-check && pnpm lint && pnpm test --run— all passing (751/751)pnpm dev:mock→ navigate to dataset detail → file browser loadsSummary by CodeRabbit
Release Notes
New Features
Improvements