fix: route dataset manifest and file preview through service for private buckets by KeitaW · Pull Request #795 · NVIDIA/OSMO

KeitaW · 2026-04-03T00:59:40Z

Summary

Fixes #793 — dataset file browser and preview broken on private S3 buckets.

Added GET /{bucket}/dataset/{name}/manifest service endpoint that reads manifest JSON from storage using bucket_config.default_credential (supports S3, GCS, Azure, Swift, TOS)
Added GET /{bucket}/dataset/{name}/file-content service endpoint that streams individual file content with storage path validation
Updated UI to fetch manifests via server action calling the backend directly (bypasses API gateway auth, works during SSR)
Updated file preview proxy to route through the service endpoint when storagePath is available

Test plan

Verify file browser loads on a dataset with a private S3 bucket
Verify file preview (image, video, text) works on private bucket files
Verify file browser still works on public bucket datasets (backward compat)
UI checks: pnpm type-check && pnpm lint && pnpm test --run — all passing (751/751)
Mock mode: pnpm dev:mock → navigate to dataset detail → file browser loads

Summary by CodeRabbit

Release Notes

New Features
- Added API endpoints for fetching dataset manifests and streaming file content.
- Enhanced file preview panel with improved dataset access handling and validation.
Improvements
- Streamlined dataset identification using bucket, name, and version for more consistent access patterns across the platform.

…t private buckets (NVIDIA#793) The UI's fetchManifest and file preview proxy performed unsigned fetch() against S3 HTTPS URLs, which fails with 403 on private buckets. Added two service-side proxy endpoints: - GET /{bucket}/dataset/{name}/manifest — reads manifest JSON from storage using bucket credentials (supports S3, GCS, Azure, Swift, TOS) - GET /{bucket}/dataset/{name}/file-content — streams individual file content with storage_path validation against dataset container Updated UI to call these endpoints through the existing /api catch-all proxy instead of direct unsigned fetch. Removed unused fetchManifest server action files.

The file preview panel sends a HEAD request before GET to check content-type and access. FastAPI's @router.get does not handle HEAD, returning 405 Method Not Allowed. Changed to @router.api_route with both GET and HEAD methods.

fetchDatasetFiles used a relative fetch('/api/bucket/...') which fails during SSR because the request goes through the proxy without browser auth cookies, resulting in 403 from the API gateway. Re-introduced the server action pattern: fetchManifest now calls the backend service directly using getServerApiBaseUrl() (internal URL), bypassing the auth gateway. Works for both SSR and client hydration.

coderabbitai · 2026-04-03T00:59:59Z

📝 Walkthrough

Walkthrough

The PR adds backend API endpoints for fetching dataset manifests and file content, shifting manifest and file access from direct unsigned frontend fetches against storage to authenticated service-proxied requests that resolve private bucket access failures.

Changes

Cohort / File(s)	Summary
Backend API Endpoints `src/service/core/data/data_service.py`	Added `GET /{bucket}/dataset/{name}/manifest` endpoint returning manifest JSON from validated READY version, and `GET/HEAD /{bucket}/dataset/{name}/file-content` endpoint streaming file bytes with storage path container validation.
Frontend Proxy Route `src/ui/src/app/proxy/dataset/file/route.impl.production.ts`	Changed request parameter parsing from `url`-only to `bucket`/`name`/`storagePath`; replaced direct storage fetch with `fetchUpstream` service call; added response header forwarding via `forwardResponseHeaders`.
Frontend Components & Hooks `src/ui/src/features/datasets/detail/components/dataset-detail-content.tsx`, `src/ui/src/features/datasets/detail/components/file-preview-panel.tsx`, `src/ui/src/lib/api/adapter/datasets-hooks.ts`, `src/ui/src/lib/api/adapter/datasets.ts`	Resolved and passed dataset identity (`bucket`, `name`, `version`) through component hierarchy and hook calls; refactored manifest fetching to accept these parameters instead of deriving from `location` alone; updated proxy URL construction to use bucket/name/storagePath.
Production Manifest Fetcher `src/ui/src/lib/api/server/dataset-actions.production.ts`	Updated `fetchManifest` signature from `(url)` to `(bucket, name, version)`; constructs and calls backend API endpoint using `getServerApiBaseUrl` instead of performing unsigned fetch.
Configuration & Cleanup `src/ui/next.config.ts`, `src/ui/src/lib/api/server/dataset-actions.ts`	Removed production alias redirect for `dataset-actions` in Turbopack config; deleted legacy dataset-actions wrapper module (was conditional mock/production selector).
Mock Handlers `src/ui/src/mocks/handlers.ts`	Added MSW handler for `/api/bucket/:bucket/dataset/:name/manifest` returning generated flat manifest; updated `/proxy/dataset/file` handler to parse both new (`storagePath`, `name`) and legacy (`url`) parameter schemes.

Sequence Diagram

sequenceDiagram
    actor Browser as Browser/UI
    participant Proxy as Frontend<br/>Proxy Route
    participant Service as Backend<br/>Service
    participant Storage as Object<br/>Storage (S3)

    Note over Browser,Storage: New Flow: Service-Proxied Authenticated Access
    
    Browser->>Proxy: GET /proxy/dataset/file?bucket=X&name=Y&storagePath=...
    Proxy->>Service: GET /api/bucket/X/dataset/Y/file-content?storage_path=...
    Note over Service: Validate storage_path container<br/>matches dataset hash_location
    Service->>Storage: fetch(signed_url_or_creds)
    Storage-->>Service: File bytes
    Service-->>Proxy: StreamingResponse<br/>(inferred mime-type)
    Proxy-->>Browser: Forwarded response<br/>with headers
    
    rect rgba(100, 200, 100, 0.5)
        Note over Browser,Storage: Manifest Flow
        Browser->>Proxy: Manifest request
        Proxy->>Service: GET /api/bucket/X/dataset/Y/manifest?version=V
        Service->>Storage: Load manifest from version location
        Storage-->>Service: JSON manifest
        Service-->>Browser: Parsed manifest
    end

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 From unsigned fetches fraught with 403s,
We hop through service gates with verified keys.
Manifests bloom, files stream with grace,
Private buckets now find their safe place. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 45.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title accurately describes the main change: routing dataset manifest and file preview through the service to support private buckets.
Linked Issues check	✅ Passed	The PR fully addresses issue `#793` by implementing service endpoints for manifest and file-content retrieval that bypass unsigned fetches, and updating the UI to route through these authenticated endpoints.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to fixing issue `#793`: adding backend service endpoints, updating UI routing, and removing obsolete client-side code. No unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

⚔️ Resolve merge conflicts

Resolve merge conflict in branch fix/private-bucket-manifest-793

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/service/core/data/data_service.py`:
- Around line 1032-1036: The current check only compares
requested_backend.container to dataset_backend.container (constructed via
storage.construct_storage_backend), which allows any path in the same container;
instead validate that the supplied storage_path is within the dataset's storage
prefix by ensuring storage_path (or the backend's full path) starts with
dataset_info.hash_location (or its normalized prefix) before proceeding; if it
does not, raise osmo_errors.OSMOUserError with a clear message. Normalize both
paths (resolve trailing slashes, case/URL encoding as appropriate) prior to the
startswith check and keep the container equality check as a fast precondition.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e0ab25ae-7964-4301-9a03-e53a0d59233a

📥 Commits

Reviewing files that changed from the base of the PR and between 762ea59 and 2b0c696.

📒 Files selected for processing (10)

src/service/core/data/data_service.py
src/ui/next.config.ts
src/ui/src/app/proxy/dataset/file/route.impl.production.ts
src/ui/src/features/datasets/detail/components/dataset-detail-content.tsx
src/ui/src/features/datasets/detail/components/file-preview-panel.tsx
src/ui/src/lib/api/adapter/datasets-hooks.ts
src/ui/src/lib/api/adapter/datasets.ts
src/ui/src/lib/api/server/dataset-actions.production.ts
src/ui/src/lib/api/server/dataset-actions.ts
src/ui/src/mocks/handlers.ts

💤 Files with no reviewable changes (2)

src/ui/next.config.ts
src/ui/src/lib/api/server/dataset-actions.ts

coderabbitai · 2026-04-03T01:06:32Z

src/service/core/data/data_service.py

+    requested_backend = storage.construct_storage_backend(storage_path)
+    dataset_backend = storage.construct_storage_backend(dataset_info.hash_location)
+    if requested_backend.container != dataset_backend.container:
+        raise osmo_errors.OSMOUserError(
+            'Storage path does not belong to this dataset.')


⚠️ Potential issue | 🟡 Minor

Container-only validation may be overly permissive.

The validation only checks that requested_backend.container matches dataset_backend.container. This allows access to any file within the same container, not just files belonging to this specific dataset's storage prefix. Consider validating that storage_path starts with dataset_info.hash_location to restrict access to only the dataset's files.

🛡️ Proposed stricter path validation

# Validate that the storage path belongs to this dataset's storage prefix requested_backend = storage.construct_storage_backend(storage_path) dataset_backend = storage.construct_storage_backend(dataset_info.hash_location) - if requested_backend.container != dataset_backend.container: + if requested_backend.container != dataset_backend.container or \ + not storage_path.startswith(dataset_info.hash_location.rstrip('/')): raise osmo_errors.OSMOUserError( 'Storage path does not belong to this dataset.')

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

requested_backend = storage.construct_storage_backend(storage_path)

dataset_backend = storage.construct_storage_backend(dataset_info.hash_location)

if requested_backend.container != dataset_backend.container:

raise osmo_errors.OSMOUserError(

'Storage path does not belong to this dataset.')

requested_backend = storage.construct_storage_backend(storage_path)

dataset_backend = storage.construct_storage_backend(dataset_info.hash_location)

if requested_backend.container != dataset_backend.container or \

not storage_path.startswith(dataset_info.hash_location.rstrip('/')):

raise osmo_errors.OSMOUserError(

'Storage path does not belong to this dataset.')

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/service/core/data/data_service.py` around lines 1032 - 1036, The current check only compares requested_backend.container to dataset_backend.container (constructed via storage.construct_storage_backend), which allows any path in the same container; instead validate that the supplied storage_path is within the dataset's storage prefix by ensuring storage_path (or the backend's full path) starts with dataset_info.hash_location (or its normalized prefix) before proceeding; if it does not, raise osmo_errors.OSMOUserError with a clear message. Normalize both paths (resolve trailing slashes, case/URL encoding as appropriate) prior to the startswith check and keep the container equality check as a fast precondition.

KeitaW added 3 commits April 3, 2026 00:58

KeitaW requested a review from a team as a code owner April 3, 2026 00:59

coderabbitai bot reviewed Apr 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: route dataset manifest and file preview through service for private buckets#795

fix: route dataset manifest and file preview through service for private buckets#795
KeitaW wants to merge 3 commits intoNVIDIA:mainfrom
KeitaW:fix/private-bucket-manifest-793

KeitaW commented Apr 3, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 3, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KeitaW commented Apr 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KeitaW commented Apr 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 3, 2026 •

edited

Loading