Skip to content

fix: route dataset manifest and file preview through service for private buckets#795

Open
KeitaW wants to merge 3 commits intoNVIDIA:mainfrom
KeitaW:fix/private-bucket-manifest-793
Open

fix: route dataset manifest and file preview through service for private buckets#795
KeitaW wants to merge 3 commits intoNVIDIA:mainfrom
KeitaW:fix/private-bucket-manifest-793

Conversation

@KeitaW
Copy link
Copy Markdown

@KeitaW KeitaW commented Apr 3, 2026

Summary

Fixes #793 — dataset file browser and preview broken on private S3 buckets.

  • Added GET /{bucket}/dataset/{name}/manifest service endpoint that reads manifest JSON from storage using bucket_config.default_credential (supports S3, GCS, Azure, Swift, TOS)
  • Added GET /{bucket}/dataset/{name}/file-content service endpoint that streams individual file content with storage path validation
  • Updated UI to fetch manifests via server action calling the backend directly (bypasses API gateway auth, works during SSR)
  • Updated file preview proxy to route through the service endpoint when storagePath is available

Test plan

  • Verify file browser loads on a dataset with a private S3 bucket
  • Verify file preview (image, video, text) works on private bucket files
  • Verify file browser still works on public bucket datasets (backward compat)
  • UI checks: pnpm type-check && pnpm lint && pnpm test --run — all passing (751/751)
  • Mock mode: pnpm dev:mock → navigate to dataset detail → file browser loads

Summary by CodeRabbit

Release Notes

  • New Features

    • Added API endpoints for fetching dataset manifests and streaming file content.
    • Enhanced file preview panel with improved dataset access handling and validation.
  • Improvements

    • Streamlined dataset identification using bucket, name, and version for more consistent access patterns across the platform.

KeitaW added 3 commits April 3, 2026 00:58
…t private buckets (NVIDIA#793)

The UI's fetchManifest and file preview proxy performed unsigned fetch()
against S3 HTTPS URLs, which fails with 403 on private buckets.

Added two service-side proxy endpoints:
- GET /{bucket}/dataset/{name}/manifest — reads manifest JSON from storage
  using bucket credentials (supports S3, GCS, Azure, Swift, TOS)
- GET /{bucket}/dataset/{name}/file-content — streams individual file
  content with storage_path validation against dataset container

Updated UI to call these endpoints through the existing /api catch-all
proxy instead of direct unsigned fetch. Removed unused fetchManifest
server action files.
The file preview panel sends a HEAD request before GET to check
content-type and access. FastAPI's @router.get does not handle HEAD,
returning 405 Method Not Allowed. Changed to @router.api_route with
both GET and HEAD methods.
fetchDatasetFiles used a relative fetch('/api/bucket/...') which fails
during SSR because the request goes through the proxy without browser
auth cookies, resulting in 403 from the API gateway.

Re-introduced the server action pattern: fetchManifest now calls the
backend service directly using getServerApiBaseUrl() (internal URL),
bypassing the auth gateway. Works for both SSR and client hydration.
@KeitaW KeitaW requested a review from a team as a code owner April 3, 2026 00:59
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 3, 2026

📝 Walkthrough

Walkthrough

The PR adds backend API endpoints for fetching dataset manifests and file content, shifting manifest and file access from direct unsigned frontend fetches against storage to authenticated service-proxied requests that resolve private bucket access failures.

Changes

Cohort / File(s) Summary
Backend API Endpoints
src/service/core/data/data_service.py
Added GET /{bucket}/dataset/{name}/manifest endpoint returning manifest JSON from validated READY version, and GET/HEAD /{bucket}/dataset/{name}/file-content endpoint streaming file bytes with storage path container validation.
Frontend Proxy Route
src/ui/src/app/proxy/dataset/file/route.impl.production.ts
Changed request parameter parsing from url-only to bucket/name/storagePath; replaced direct storage fetch with fetchUpstream service call; added response header forwarding via forwardResponseHeaders.
Frontend Components & Hooks
src/ui/src/features/datasets/detail/components/dataset-detail-content.tsx, src/ui/src/features/datasets/detail/components/file-preview-panel.tsx, src/ui/src/lib/api/adapter/datasets-hooks.ts, src/ui/src/lib/api/adapter/datasets.ts
Resolved and passed dataset identity (bucket, name, version) through component hierarchy and hook calls; refactored manifest fetching to accept these parameters instead of deriving from location alone; updated proxy URL construction to use bucket/name/storagePath.
Production Manifest Fetcher
src/ui/src/lib/api/server/dataset-actions.production.ts
Updated fetchManifest signature from (url) to (bucket, name, version); constructs and calls backend API endpoint using getServerApiBaseUrl instead of performing unsigned fetch.
Configuration & Cleanup
src/ui/next.config.ts, src/ui/src/lib/api/server/dataset-actions.ts
Removed production alias redirect for dataset-actions in Turbopack config; deleted legacy dataset-actions wrapper module (was conditional mock/production selector).
Mock Handlers
src/ui/src/mocks/handlers.ts
Added MSW handler for /api/bucket/:bucket/dataset/:name/manifest returning generated flat manifest; updated /proxy/dataset/file handler to parse both new (storagePath, name) and legacy (url) parameter schemes.

Sequence Diagram

sequenceDiagram
    actor Browser as Browser/UI
    participant Proxy as Frontend<br/>Proxy Route
    participant Service as Backend<br/>Service
    participant Storage as Object<br/>Storage (S3)

    Note over Browser,Storage: New Flow: Service-Proxied Authenticated Access
    
    Browser->>Proxy: GET /proxy/dataset/file?bucket=X&name=Y&storagePath=...
    Proxy->>Service: GET /api/bucket/X/dataset/Y/file-content?storage_path=...
    Note over Service: Validate storage_path container<br/>matches dataset hash_location
    Service->>Storage: fetch(signed_url_or_creds)
    Storage-->>Service: File bytes
    Service-->>Proxy: StreamingResponse<br/>(inferred mime-type)
    Proxy-->>Browser: Forwarded response<br/>with headers
    
    rect rgba(100, 200, 100, 0.5)
        Note over Browser,Storage: Manifest Flow
        Browser->>Proxy: Manifest request
        Proxy->>Service: GET /api/bucket/X/dataset/Y/manifest?version=V
        Service->>Storage: Load manifest from version location
        Storage-->>Service: JSON manifest
        Service-->>Browser: Parsed manifest
    end
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 From unsigned fetches fraught with 403s,
We hop through service gates with verified keys.
Manifests bloom, files stream with grace,
Private buckets now find their safe place. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately describes the main change: routing dataset manifest and file preview through the service to support private buckets.
Linked Issues check ✅ Passed The PR fully addresses issue #793 by implementing service endpoints for manifest and file-content retrieval that bypass unsigned fetches, and updating the UI to route through these authenticated endpoints.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing issue #793: adding backend service endpoints, updating UI routing, and removing obsolete client-side code. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch fix/private-bucket-manifest-793

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/service/core/data/data_service.py`:
- Around line 1032-1036: The current check only compares
requested_backend.container to dataset_backend.container (constructed via
storage.construct_storage_backend), which allows any path in the same container;
instead validate that the supplied storage_path is within the dataset's storage
prefix by ensuring storage_path (or the backend's full path) starts with
dataset_info.hash_location (or its normalized prefix) before proceeding; if it
does not, raise osmo_errors.OSMOUserError with a clear message. Normalize both
paths (resolve trailing slashes, case/URL encoding as appropriate) prior to the
startswith check and keep the container equality check as a fast precondition.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e0ab25ae-7964-4301-9a03-e53a0d59233a

📥 Commits

Reviewing files that changed from the base of the PR and between 762ea59 and 2b0c696.

📒 Files selected for processing (10)
  • src/service/core/data/data_service.py
  • src/ui/next.config.ts
  • src/ui/src/app/proxy/dataset/file/route.impl.production.ts
  • src/ui/src/features/datasets/detail/components/dataset-detail-content.tsx
  • src/ui/src/features/datasets/detail/components/file-preview-panel.tsx
  • src/ui/src/lib/api/adapter/datasets-hooks.ts
  • src/ui/src/lib/api/adapter/datasets.ts
  • src/ui/src/lib/api/server/dataset-actions.production.ts
  • src/ui/src/lib/api/server/dataset-actions.ts
  • src/ui/src/mocks/handlers.ts
💤 Files with no reviewable changes (2)
  • src/ui/next.config.ts
  • src/ui/src/lib/api/server/dataset-actions.ts

Comment on lines +1032 to +1036
requested_backend = storage.construct_storage_backend(storage_path)
dataset_backend = storage.construct_storage_backend(dataset_info.hash_location)
if requested_backend.container != dataset_backend.container:
raise osmo_errors.OSMOUserError(
'Storage path does not belong to this dataset.')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Container-only validation may be overly permissive.

The validation only checks that requested_backend.container matches dataset_backend.container. This allows access to any file within the same container, not just files belonging to this specific dataset's storage prefix. Consider validating that storage_path starts with dataset_info.hash_location to restrict access to only the dataset's files.

🛡️ Proposed stricter path validation
     # Validate that the storage path belongs to this dataset's storage prefix
     requested_backend = storage.construct_storage_backend(storage_path)
     dataset_backend = storage.construct_storage_backend(dataset_info.hash_location)
-    if requested_backend.container != dataset_backend.container:
+    if requested_backend.container != dataset_backend.container or \
+       not storage_path.startswith(dataset_info.hash_location.rstrip('/')):
         raise osmo_errors.OSMOUserError(
             'Storage path does not belong to this dataset.')
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
requested_backend = storage.construct_storage_backend(storage_path)
dataset_backend = storage.construct_storage_backend(dataset_info.hash_location)
if requested_backend.container != dataset_backend.container:
raise osmo_errors.OSMOUserError(
'Storage path does not belong to this dataset.')
requested_backend = storage.construct_storage_backend(storage_path)
dataset_backend = storage.construct_storage_backend(dataset_info.hash_location)
if requested_backend.container != dataset_backend.container or \
not storage_path.startswith(dataset_info.hash_location.rstrip('/')):
raise osmo_errors.OSMOUserError(
'Storage path does not belong to this dataset.')
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/service/core/data/data_service.py` around lines 1032 - 1036, The current
check only compares requested_backend.container to dataset_backend.container
(constructed via storage.construct_storage_backend), which allows any path in
the same container; instead validate that the supplied storage_path is within
the dataset's storage prefix by ensuring storage_path (or the backend's full
path) starts with dataset_info.hash_location (or its normalized prefix) before
proceeding; if it does not, raise osmo_errors.OSMOUserError with a clear
message. Normalize both paths (resolve trailing slashes, case/URL encoding as
appropriate) prior to the startswith check and keep the container equality check
as a fast precondition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dataset file browser and preview broken on private storage buckets — unsigned fetch

1 participant