Skip to content

chore(autorag+automl): improvements - S3 timeout error handling + FileExplorer UX#7219

Merged
openshift-merge-bot[bot] merged 20 commits intoopendatahub-io:mainfrom
GAUNSD:gmurcia/file-explorer-timeout-airgapped
Apr 21, 2026
Merged

chore(autorag+automl): improvements - S3 timeout error handling + FileExplorer UX#7219
openshift-merge-bot[bot] merged 20 commits intoopendatahub-io:mainfrom
GAUNSD:gmurcia/file-explorer-timeout-airgapped

Conversation

@GAUNSD
Copy link
Copy Markdown
Contributor

@GAUNSD GAUNSD commented Apr 13, 2026

Issue

Description

This PR addresses two items; one major one minor.

🔴 Major: Unreachable S3 connections in air-gapped environments

  • After testing being done in an air-gapped environments, it was found that S3 connections that point to areas inaccessible on the network eventually just timeout when trying to list S3 files.
  • The cause: The BFF's use of the S3 SDK had some defaults in place: 3 re-try attempts, default timeout of none. This loses over the general 30second timeout typically found in openshift clusters
  • The fix: Provide the S3 SDK some configured defaults for the HTTP connection to not retry if it's a disconnect error and to catch any specific connection errors with a specific message. This is all caught in the UI layer as well and a nice error message is provided
  • More details on logs/root cause/evidence in the issue

🔵 Minor: UX improvements for 'View details' CTA in the FileExplorer

  • When FileA is selected and its details are in view, the overflow menu in the selected file area still shows 'View details'
  • The better UX is to simply hide that button since the details of that file are already in view
  • image

How Has This Been Tested?

  • Manually
  • Test cases added

Test Impact

  • New test cases added (x2 for each autorag/automl):
    • BFF behaviour
    • UI S3FileExplorer
    • UI FileExplorer

Request review criteria:

Self checklist (all need to be checked):

  • The developer has manually tested the changes and verified that the changes work
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has added tests or explained why testing cannot be added (unit or cypress tests for related changes)
  • The code follows our Best Practices (React coding standards, PatternFly usage, performance considerations)

If you have UI changes:

  • Included any necessary screenshots or gifs if it was a UI change.
  • Included tags to the UX team if it was a UI/UX change.

After the PR is posted & before it merges:

  • The developer has tested their solution on a cluster by using the image produced by the PR to main

Summary by CodeRabbit

  • Bug Fixes

    • S3/storage connectivity failures now return clear Service Unavailable (503) responses with bucket-scoped messaging.
    • Read-only S3 metadata operations now use timeouts to avoid long hangs and surface failures sooner.
    • S3 client connection behavior is more fail-fast; upload connectivity failures are surfaced as service-unavailable errors.
  • New Features

    • UI: S3 connectivity maps to a dedicated "S3 endpoint unreachable" empty state with guidance.
    • File explorer hides "View details" for files already being viewed.
  • Tests

    • Expanded coverage for connectivity classification, timeout propagation, client transport behavior, and related handler responses.

Functional changes:
- Add 10s connect/TLS timeout to S3 HTTP client via custom transport
  so unreachable endpoints (air-gapped, misconfigured) fail fast
  instead of hanging until the OpenShift route 30s gateway timeout.
- Set RetryMaxAttempts=1 on the AWS config to avoid compounding
  the timeout on dead endpoints.
- Add isS3ConnectivityError() to classify net.Error timeouts,
  net.OpError, and net.DNSError as connectivity failures.
- Add s3ConnectivityErrorMessage() to return a user-facing 503
  message explaining the likely cause and remediation steps.
- Wire connectivity error detection into all S3 handlers
  (GetS3File, PostS3File, GetS3Files, GetS3FileSchema) so they
  return 503 with the actionable message instead of a generic 500.

Test coverage:
- isS3ConnectivityError: 9-case table-driven test covering timeout,
  OpError, DNSError, wrapped variants, nil, and non-connectivity
  errors (access denied).
- s3ConnectivityErrorMessage: asserts bucket name and key phrases
  appear in the message.
- Handler-level 503 tests: GetS3File, GetS3Files, PostS3File
  (on key resolution), and PostS3File (on upload) all return 503
  with the connectivity error message when the S3 endpoint is
  unreachable.
- s3ConnectTimeout constant: asserts the value is 10s.
- NewRealS3Client: verifies client construction succeeds with
  the new timeout-aware HTTP transport.

All changes are mirrored across both automl and autorag BFFs.

Generated-by: Claude <noreply@anthropic.com>
Co-authored-by: Claude <noreply@anthropic.com>
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress This PR is in WIP state label Apr 13, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 13, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 13, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds S3 connectivity classification and bucket-scoped 503 responses, introduces a 15s metadata deadline for read-only S3 calls, and configures a connect/TLS handshake timeout plus reduced retry attempts in the S3 HTTP transport. Backend S3 handlers in automl and autorag now detect network/DNS/timeout errors via isS3ConnectivityError and return serviceUnavailableResponseWithMessage(bucket). S3 client construction uses s3ConnectTimeout and RetryMaxAttempts: 1. Frontend updates map connectivity errors to an “S3 endpoint unreachable” empty state and FileExplorer now accepts filesToView to suppress “View details.”

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Security & Code Quality Issues

  1. Reduced retry policy may affect availability under transient network faults. Action: confirm intent; if not deliberate, make retries configurable or use exponential backoff. (Availability risk) — Related concern: operational resilience (no specific CWE), consider design change rather than purely security fix.

  2. User-facing messages expose bucket/endpoint details (information exposure). Action: sanitize or limit client-visible details; log full diagnostics server-side under restricted access. (CWE-200: Information Exposure)

  3. Error-classification uses concrete error matching and string checks (fragile). Action: use errors.As / errors.Unwrap to detect net.Error and context errors across wrappers; add integration tests exercising real SDK error shapes. (CWE-665: Improper Initialization — related to fragile error handling patterns)

  4. Frontend depends on substring matching of error messages (brittle). Action: add a machine-readable API error code (e.g., "S3_CONNECTIVITY") and update UI to rely on that code instead of message text. (CWE-20: Improper Input Validation — user-visible parsing of unstructured error text)

  5. Duplicate S3 helper/config logic across packages risks drift. Action: extract shared S3 config/helpers to a single internal package to centralize behavior and tests. (Maintainability; reduces risk of inconsistent behavior)

  6. Hardcoded timeouts and TLS/connect settings are not configurable (operational inflexibility). Action: surface s3ConnectTimeout and s3MetadataTimeout via configuration or environment variables and document defaults. (CWE-15: External Control of System or Configuration Setting — if later exposed incorrectly)

  7. Tests use unit mocks that may not reflect SDK wrapping/transport behavior. Action: add integration or contract tests against a controlled S3-compatible endpoint (or recorded SDK errors) to validate classification, retry, and timeout behavior. (CWE-1006: Use of Hard-coded Credentials/Mocks in tests can mask real-world failures)

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed Title clearly summarizes the two main changes: S3 timeout error handling and FileExplorer UX improvements across autorag and automl packages.
Description check ✅ Passed Description includes linked issues, detailed explanation of both changes (major S3 air-gap fix and minor UX improvement), testing approach, and completed self-checklist items including manual testing and test coverage.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Generated-by: Claude <noreply@anthropic.com>
Co-authored-by: Claude <noreply@anthropic.com>
GAUNSD and others added 2 commits April 13, 2026 16:39
Generated-by: Claude <noreply@anthropic.com>
Co-authored-by: Claude <noreply@anthropic.com>
@GAUNSD GAUNSD marked this pull request as ready for review April 13, 2026 20:54
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress This PR is in WIP state label Apr 13, 2026
@GAUNSD GAUNSD changed the title chore(autorag+automl): S3 timeout error handling improvements chore(autorag+automl): improvements - S3 timeout error handling + FileExplorer UX Apr 13, 2026
@GAUNSD
Copy link
Copy Markdown
Contributor Author

GAUNSD commented Apr 13, 2026

/cc @nickmazzi @daniduong

@openshift-ci openshift-ci Bot requested review from daniduong and nickmazzi April 13, 2026 20:56
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (2)
packages/autorag/frontend/src/app/components/common/FileExplorer/__tests__/FileExplorer.spec.tsx (1)

233-243: Scope dropdown menu assertions to the opened menu container.

Lines 234–235 and 240–241 use global screen.queryByText() and screen.getByText() assertions. When multiple menu instances exist in the DOM (especially if menus remain mounted but hidden after closing), these assertions can become brittle and match unintended elements. Scope assertions to the active dropdown using within(screen.getByRole('menu')).

Proposed test hardening
       // file-2 is currently being viewed — its kebab should NOT have "View details"
       fireEvent.click(within(selectedFilesList).getByLabelText('file-2.json overflow menu'));
-      expect(screen.queryByText('View details')).not.toBeInTheDocument();
-      expect(screen.getByText('Remove selection')).toBeInTheDocument();
+      const viewedFileMenu = screen.getByRole('menu');
+      expect(
+        within(viewedFileMenu).queryByRole('menuitem', { name: 'View details' }),
+      ).not.toBeInTheDocument();
+      expect(
+        within(viewedFileMenu).getByRole('menuitem', { name: 'Remove selection' }),
+      ).toBeInTheDocument();

       // Close the dropdown by clicking the toggle again
       fireEvent.click(within(selectedFilesList).getByLabelText('file-2.json overflow menu'));

       // file-1 is NOT being viewed — its kebab SHOULD have "View details"
       fireEvent.click(within(selectedFilesList).getByLabelText('file-1.json overflow menu'));
-      expect(screen.getByText('View details')).toBeInTheDocument();
-      expect(screen.getByText('Remove selection')).toBeInTheDocument();
+      const otherFileMenu = screen.getByRole('menu');
+      expect(
+        within(otherFileMenu).getByRole('menuitem', { name: 'View details' }),
+      ).toBeInTheDocument();
+      expect(
+        within(otherFileMenu).getByRole('menuitem', { name: 'Remove selection' }),
+      ).toBeInTheDocument();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/autorag/frontend/src/app/components/common/FileExplorer/__tests__/FileExplorer.spec.tsx`
around lines 233 - 243, The assertions in FileExplorer.spec.tsx are too broad
because they use global screen.getByText()/queryByText() after toggling kebab
menus for items like 'file-2.json' and 'file-1.json'; narrow these checks to the
active dropdown container by using within on the currently opened menu (e.g.,
within(screen.getByRole('menu')) or within(theOpenedMenuElement) after calling
fireEvent.click on within(selectedFilesList).getByLabelText('file-2.json
overflow menu') and the 'file-1.json overflow menu') so that the expectations
(presence/absence of 'View details' and 'Remove selection') target the correct
menu instance rather than any matching element in the DOM.
packages/automl/bff/internal/api/s3_handler.go (1)

177-187: Don’t make the UI branch on this English message.

PR context says the frontend detects the unreachable-endpoint state by substring-matching the text produced here. That contract is fragile: any copy edit, localization, or automl/autorag drift will silently break the UX. Add a stable machine-readable reason in the error payload and have the UI switch on that instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/automl/bff/internal/api/s3_handler.go` around lines 177 - 187, The
UI should not rely on English text from s3ConnectivityErrorMessage for
branching; add a stable machine-readable reason to the error payload (e.g.
constant "ReasonS3EndpointUnreachable" or "s3_endpoint_unreachable") and return
it alongside the human-facing string. Update the response/error construction
code that currently calls s3ConnectivityErrorMessage so it populates the new
reason field (modify the error response struct or the API response wrapper used
by the S3 handlers), export the reason constant, and ensure callers still send
the original message for display while the frontend switches to the new reason
value for logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/automl/bff/internal/api/s3_handler_test.go`:
- Around line 2198-2334: Add a new table-style unit test mirroring the existing
connectivity tests that targets GetS3FileSchemaHandler: create
TestGetS3FileSchemaHandler_ConnectivityError_Returns503 that sets up a mock
secret (mockS3Secret), mock Kubernetes factory
(mockKubernetesClientFactoryForSecrets), s3 client factory with
connectivityErrorS3Client
(s3mocks.NewMockClientFactory().SetMockClient(&connectivityErrorS3Client{})),
and identity, then call setupS3ApiTestWithBody with http.MethodGet against the
schema endpoint (e.g.
"/api/v1/s3/file/schema?namespace=test-namespace&secretName=aws-secret-1&bucket=my-bucket&key=file.csv"),
assert res.StatusCode == http.StatusServiceUnavailable, unmarshal into
ErrorEnvelope and assert env.Error.Code == "503" and the error message contains
the bucket name and the expected connectivity phrase (e.g. "Unable to connect")
to exercise GetS3FileSchemaHandler's 503 branch.

In `@packages/automl/bff/internal/api/s3_handler.go`:
- Around line 164-175: The connectivity classifier is missing a check for the
net.ErrClosed sentinel, so isS3ConnectivityError currently returns false for
closed keep-alive sockets; update isS3ConnectivityError to treat errors.Is(err,
net.ErrClosed) as a connectivity error (return true), in addition to the
existing timeout/*net.OpError*//*net.DNSError*/ checks, and apply the same
change to the autorag copy of isS3ConnectivityError so closed-socket errors map
to 503 as intended.

In `@packages/automl/bff/internal/integrations/s3/client_test.go`:
- Around line 140-155: The test NewRealS3Client_SetsRetryMaxAttemptsToOne is
network-dependent and doesn't actually assert RetryMaxAttempts; update it to
remove DNS by using a literal IP for EndpointURL (e.g., replace
"https://s3.amazonaws.com" with a literal https://<public-ip>) and then either
(A) make NewRealS3Client return or expose the underlying aws.Config so the test
can directly assert that RetryMaxAttempts == 1 (e.g., return config alongside
the client or add an accessor to inspect the built config), or (B) if you don't
want to change production signatures, rename the test to not claim it verifies
RetryMaxAttempts and instead only asserts successful construction without DNS
dependency; reference NewRealS3Client, S3Credentials, S3ClientOptions, and
RetryMaxAttempts when making the changes.

In `@packages/automl/bff/internal/integrations/s3/client.go`:
- Around line 103-119: The transport only limits TCP connect and TLS handshake
(t.DialContext and t.TLSHandshakeTimeout) but leaves the response phase
unbounded; update the HTTP transport created via
awshttp.NewBuildableClient().WithTransportOptions (the
httpClient/WithTransportOptions block) to also set a response-header deadline
(e.g., set t.ResponseHeaderTimeout to a new s3ResponseHeaderTimeout constant) or
ensure every S3 call using this cfg enforces a request context deadline; modify
the transport in the httpClient creation or add caller-side context timeouts so
responses cannot hang indefinitely.

In `@packages/autorag/bff/internal/api/s3_handler.go`:
- Around line 33-44: The isS3ConnectivityError helper currently checks for
net.Error, *net.OpError, and *net.DNSError but misses the sentinel
net.ErrClosed; update isS3ConnectivityError to also return true when
errors.Is(err, net.ErrClosed) so closed-connection errors are classified as
connectivity issues (locate and modify the isS3ConnectivityError function to add
the errors.Is(err, net.ErrClosed) check alongside the existing checks).

In `@packages/autorag/bff/internal/integrations/s3/client_test.go`:
- Around line 166-181: The test NewRealS3Client_SetsRetryMaxAttemptsToOne is
DNS-dependent and doesn't actually verify RetryMaxAttempts; change the test to
avoid DNS by using a literal IP for EndpointURL (e.g., an S3 IP) or a loopback
test server, and then either (a) modify NewRealS3Client to optionally return or
expose the constructed aws.Config (so the test can assert cfg.Retry.MaxAttempts
== 1) or (b) if changing signatures is undesirable, rename the test to reflect
it only verifies client creation and add a new unit test that inspects the built
config by calling the internal builder (e.g., the function that builds
aws.Config used by NewRealS3Client) to assert RetryMaxAttempts == 1; reference
NewRealS3Client, S3Credentials and the RetryMaxAttempts config value in the
change.

In `@packages/autorag/bff/internal/integrations/s3/client.go`:
- Around line 83-99: The transport only caps TCP connect and TLS handshake
timeouts (see httpClient created via awshttp.NewBuildableClient and the
transport options) but does not bound the response phase; update the transport
to set a response-header deadline (e.g., Transport.ResponseHeaderTimeout) so the
client fails fast if headers never arrive, or ensure callers use a context with
an overall request deadline when invoking the S3 client; modify the transport
configuration in the httpClient creation (the WithTransportOptions closure) to
include the response timeout or add request-deadline enforcement at the S3 call
sites.

---

Nitpick comments:
In `@packages/automl/bff/internal/api/s3_handler.go`:
- Around line 177-187: The UI should not rely on English text from
s3ConnectivityErrorMessage for branching; add a stable machine-readable reason
to the error payload (e.g. constant "ReasonS3EndpointUnreachable" or
"s3_endpoint_unreachable") and return it alongside the human-facing string.
Update the response/error construction code that currently calls
s3ConnectivityErrorMessage so it populates the new reason field (modify the
error response struct or the API response wrapper used by the S3 handlers),
export the reason constant, and ensure callers still send the original message
for display while the frontend switches to the new reason value for logic.

In
`@packages/autorag/frontend/src/app/components/common/FileExplorer/__tests__/FileExplorer.spec.tsx`:
- Around line 233-243: The assertions in FileExplorer.spec.tsx are too broad
because they use global screen.getByText()/queryByText() after toggling kebab
menus for items like 'file-2.json' and 'file-1.json'; narrow these checks to the
active dropdown container by using within on the currently opened menu (e.g.,
within(screen.getByRole('menu')) or within(theOpenedMenuElement) after calling
fireEvent.click on within(selectedFilesList).getByLabelText('file-2.json
overflow menu') and the 'file-1.json overflow menu') so that the expectations
(presence/absence of 'View details' and 'Remove selection') target the correct
menu instance rather than any matching element in the DOM.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 4e006577-50b0-48f2-a95b-6ce2d788d542

📥 Commits

Reviewing files that changed from the base of the PR and between 38e3af3 and 53b49f6.

📒 Files selected for processing (16)
  • packages/automl/bff/internal/api/s3_handler.go
  • packages/automl/bff/internal/api/s3_handler_test.go
  • packages/automl/bff/internal/integrations/s3/client.go
  • packages/automl/bff/internal/integrations/s3/client_test.go
  • packages/automl/frontend/src/app/components/common/FileExplorer/FileExplorer.tsx
  • packages/automl/frontend/src/app/components/common/FileExplorer/__tests__/FileExplorer.spec.tsx
  • packages/automl/frontend/src/app/components/common/S3FileExplorer/S3FileExplorer.tsx
  • packages/automl/frontend/src/app/components/common/S3FileExplorer/__tests__/S3FileExplorer.spec.tsx
  • packages/autorag/bff/internal/api/s3_handler.go
  • packages/autorag/bff/internal/api/s3_handler_test.go
  • packages/autorag/bff/internal/integrations/s3/client.go
  • packages/autorag/bff/internal/integrations/s3/client_test.go
  • packages/autorag/frontend/src/app/components/common/FileExplorer/FileExplorer.tsx
  • packages/autorag/frontend/src/app/components/common/FileExplorer/__tests__/FileExplorer.spec.tsx
  • packages/autorag/frontend/src/app/components/common/S3FileExplorer/S3FileExplorer.tsx
  • packages/autorag/frontend/src/app/components/common/S3FileExplorer/__tests__/S3FileExplorer.spec.tsx

Comment thread packages/automl/bff/internal/api/s3_handler_test.go
Comment thread packages/automl/bff/internal/api/s3_handler.go
Comment thread packages/automl/bff/internal/integrations/s3/client_test.go Outdated
Comment thread packages/automl/bff/internal/integrations/s3/client.go Outdated
Comment thread packages/autorag/bff/internal/api/s3_handler.go
Comment thread packages/autorag/bff/internal/integrations/s3/client_test.go Outdated
Comment thread packages/autorag/bff/internal/integrations/s3/client.go Outdated
automl: Add TestGetS3FileSchemaHandler_ConnectivityError_Returns503 to
exercise the 503 branch when GetCSVSchema hits an unreachable S3
endpoint. Added GetCSVSchema override to connectivityErrorS3Client.

automl+autorag: Extract buildS3AWSConfig() from NewRealS3Client so
RetryMaxAttempts can be directly asserted in tests. The AWS SDK does
not expose RetryMaxAttempts on the constructed *s3.Client, so the
previous test could only verify client creation — not the config value.
The extracted function returns the raw aws.Config, enabling a real
assertion (cfg.RetryMaxAttempts == 1) without DNS or network
dependencies.

Rename TestNewRealS3Client_SetsRetryMaxAttemptsToOne to
TestNewRealS3Client_CreatesClientWithValidCredentials and switch
endpoint to a literal IP to remove DNS dependency.

Generated-by: Claude <noreply@anthropic.com>
Co-authored-by: Claude <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.98%. Comparing base (ab20759) to head (08eb0d1).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #7219      +/-   ##
==========================================
+ Coverage   63.92%   64.98%   +1.06%     
==========================================
  Files        2502     2447      -55     
  Lines       77696    76159    -1537     
  Branches    19756    19216     -540     
==========================================
- Hits        49664    49489     -175     
+ Misses      28032    26670    -1362     

see 79 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ab20759...08eb0d1. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

GAUNSD and others added 2 commits April 13, 2026 17:38
Add net.ErrClosed sentinel check to isS3ConnectivityError() in both
automl and autorag BFFs. A closed keep-alive socket can surface as
net.ErrClosed without a *net.OpError wrapper, causing it to fall
through to a 500 instead of the intended 503. The new check catches
both direct and wrapped net.ErrClosed errors.

Includes unit tests for both direct and wrapped net.ErrClosed cases
in both packages.

Generated-by: Claude <noreply@anthropic.com>
Co-authored-by: Claude <noreply@anthropic.com>
Add per-call context deadlines (s3MetadataTimeout = 15s) on read-only
S3 handler operations to bound the response-header phase. net/http's
WriteTimeout sets a conn deadline but does NOT cancel r.Context(), so
if an S3 endpoint accepts the TCP connection but never sends response
headers, metadata calls could hang indefinitely. File-transfer handlers
(GetObject, UploadObject) are intentionally excluded because legitimate
large payloads can exceed any static timeout.

Handlers protected:
- autorag: GetS3FilesHandler (ListObjects)
- automl: GetS3FilesHandler (ListObjects), GetS3FileSchemaHandler (GetCSVSchema)

Tests added:
- Positive: verify ListObjects/GetCSVSchema receive a deadline-aware context
- Negative: verify GetObject does NOT get a handler-imposed deadline

Generated-by: Claude <noreply@anthropic.com>
Co-authored-by: Claude <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
packages/automl/bff/internal/api/s3_handler.go (1)

316-324: ⚠️ Potential issue | 🟠 Major

Time out the pre-upload HeadObject/collision check.

resolveNonCollidingS3Key() is still running on r.Context(), so the ObjectExists/HeadObject preflight can hang until the route timeout if the endpoint accepts the socket and then stalls. That leaves the upload path outside the new fail-fast behavior, and this 503 branch will never execute until much later.

Suggested patch
-	ctx := r.Context()
 	bucket := s3.bucket
-	resolvedKey, err := resolveNonCollidingS3Key(ctx, s3.client, bucket, key, app.effectivePostS3CollisionAttempts())
+	metadataCtx, cancel := context.WithTimeout(r.Context(), s3MetadataTimeout)
+	defer cancel()
+	resolvedKey, err := resolveNonCollidingS3Key(metadataCtx, s3.client, bucket, key, app.effectivePostS3CollisionAttempts())

Then keep the actual upload on the request context:

if err := s3.client.UploadObject(r.Context(), bucket, resolvedKey, limitedFile, contentType); err != nil {

As per coding guidelines, "HTTP clients to external services must set timeouts and use TLS verification."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/automl/bff/internal/api/s3_handler.go` around lines 316 - 324, The
pre-upload collision check is using the request context (r.Context()) and can
hang until the route timeout; change resolveNonCollidingS3Key to run with a
short, bounded context (create ctxPreflight, cancel :=
context.WithTimeout(r.Context(), <reasonableDuration>); defer cancel()) and pass
that ctxPreflight into resolveNonCollidingS3Key so HeadObject/ObjectExists calls
time out quickly; afterwards keep the actual upload on the original request
context (call s3.client.UploadObject with r.Context() and the resolvedKey) so
the upload path observes the request's fail-fast behavior and the 503 branch
(isS3ConnectivityError) can trigger promptly.
packages/autorag/bff/internal/api/s3_handler.go (1)

332-340: ⚠️ Potential issue | 🟠 Major

Bound the upload preflight with s3MetadataTimeout.

The new timeout only covers list operations. resolveNonCollidingS3Key() still does ObjectExists/HeadObject calls on r.Context(), so an endpoint that stalls after connect can still block uploads until the route timeout before any 503 handling runs.

Suggested patch
-	ctx := r.Context()
 	bucket := s3.bucket
-	resolvedKey, err := resolveNonCollidingS3Key(ctx, s3.client, bucket, key, app.effectivePostS3CollisionAttempts())
+	metadataCtx, cancel := context.WithTimeout(r.Context(), s3MetadataTimeout)
+	defer cancel()
+	resolvedKey, err := resolveNonCollidingS3Key(metadataCtx, s3.client, bucket, key, app.effectivePostS3CollisionAttempts())

Then call UploadObject with r.Context() so the upload itself stays unbounded for legitimate large transfers.

As per coding guidelines, "HTTP clients to external services must set timeouts and use TLS verification."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/autorag/bff/internal/api/s3_handler.go` around lines 332 - 340,
resolveNonCollidingS3Key is currently using r.Context(), so its
ObjectExists/HeadObject calls can hang until the route timeout; wrap the
preflight/key-resolution in a context with the s3MetadataTimeout (create ctx,
cancel := context.WithTimeout(r.Context(), s3MetadataTimeout); defer cancel())
and pass that bounded ctx into resolveNonCollidingS3Key (and any downstream
ObjectExists/HeadObject calls) to ensure quick 503s on metadata stalls; keep the
actual UploadObject call using the original r.Context() so large uploads remain
unbounded, and verify the S3 HTTP client used by s3.client respects timeouts and
TLS verification per guidelines.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/automl/bff/internal/integrations/s3/client_test.go`:
- Around line 143-147: The test uses AWS-like credential-shaped literals in the
S3Credentials passed to NewRealS3Client (AccessKeyID and SecretAccessKey), which
trigger secret scanners; update the test to use neutral, non-credential
placeholders instead (e.g., "test-access-key" and "test-secret-key" or similar)
when constructing the S3Credentials in the NewRealS3Client call so the
client/err setup remains the same but no AWS-formatted keys are present.

In `@packages/autorag/bff/internal/api/s3_handler_test.go`:
- Around line 1666-1678: The helper newTestS3Secret currently seeds a
realistic-looking AWS access key ("AKIA...") which triggers secret scanners;
change the AWS_ACCESS_KEY_ID value to a non-AWS-looking placeholder (e.g.,
"AKIAEXAMPLE" or "ACCESS_KEY_PLACEHOLDER") or derive it from the input name so
it is not a real-format credential, and update
AWS_SECRET_ACCESS_KEY/AWS_DEFAULT_REGION/AWS_S3_ENDPOINT similarly to benign
placeholders if needed; locate and edit the newTestS3Secret function to replace
those Data map entries accordingly without adding any new AKIA-style fixtures.

In `@packages/autorag/bff/internal/integrations/s3/client_test.go`:
- Around line 169-173: The test uses credential-shaped fixture values in the
S3Credentials passed to NewRealS3Client (AccessKeyID, SecretAccessKey, Region,
EndpointURL) which triggers secret scanners; replace them with clearly
synthetic/non-credential-shaped placeholders (e.g., "TEST_ACCESS_KEY",
"TEST_SECRET", "test-region", "https://example.invalid") in the S3Credentials
struct used in the client_test.go to avoid false-positive scanner alerts while
keeping the test semantics the same.

---

Outside diff comments:
In `@packages/automl/bff/internal/api/s3_handler.go`:
- Around line 316-324: The pre-upload collision check is using the request
context (r.Context()) and can hang until the route timeout; change
resolveNonCollidingS3Key to run with a short, bounded context (create
ctxPreflight, cancel := context.WithTimeout(r.Context(), <reasonableDuration>);
defer cancel()) and pass that ctxPreflight into resolveNonCollidingS3Key so
HeadObject/ObjectExists calls time out quickly; afterwards keep the actual
upload on the original request context (call s3.client.UploadObject with
r.Context() and the resolvedKey) so the upload path observes the request's
fail-fast behavior and the 503 branch (isS3ConnectivityError) can trigger
promptly.

In `@packages/autorag/bff/internal/api/s3_handler.go`:
- Around line 332-340: resolveNonCollidingS3Key is currently using r.Context(),
so its ObjectExists/HeadObject calls can hang until the route timeout; wrap the
preflight/key-resolution in a context with the s3MetadataTimeout (create ctx,
cancel := context.WithTimeout(r.Context(), s3MetadataTimeout); defer cancel())
and pass that bounded ctx into resolveNonCollidingS3Key (and any downstream
ObjectExists/HeadObject calls) to ensure quick 503s on metadata stalls; keep the
actual UploadObject call using the original r.Context() so large uploads remain
unbounded, and verify the S3 HTTP client used by s3.client respects timeouts and
TLS verification per guidelines.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 6c7c8c69-e22f-477f-92c7-4b7dc96cf68c

📥 Commits

Reviewing files that changed from the base of the PR and between 53b49f6 and 82cf7ab.

📒 Files selected for processing (8)
  • packages/automl/bff/internal/api/s3_handler.go
  • packages/automl/bff/internal/api/s3_handler_test.go
  • packages/automl/bff/internal/integrations/s3/client.go
  • packages/automl/bff/internal/integrations/s3/client_test.go
  • packages/autorag/bff/internal/api/s3_handler.go
  • packages/autorag/bff/internal/api/s3_handler_test.go
  • packages/autorag/bff/internal/integrations/s3/client.go
  • packages/autorag/bff/internal/integrations/s3/client_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/automl/bff/internal/integrations/s3/client.go

Comment thread packages/automl/bff/internal/integrations/s3/client_test.go
Comment thread packages/autorag/bff/internal/api/s3_handler_test.go
Comment thread packages/autorag/bff/internal/integrations/s3/client_test.go
@GAUNSD GAUNSD marked this pull request as draft April 13, 2026 23:24
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress This PR is in WIP state label Apr 13, 2026
@GAUNSD
Copy link
Copy Markdown
Contributor Author

GAUNSD commented Apr 13, 2026

What was going to be a simple tweak to enhance error handling for inaccessible S3 connections turned out to be a larger change that has higher risk for this release.

Marking it as draft for now. Will still address review feedback.

GAUNSD and others added 2 commits April 14, 2026 12:41
Generated-by: Claude <noreply@anthropic.com>
Co-authored-by: Claude <noreply@anthropic.com>
@GAUNSD GAUNSD marked this pull request as ready for review April 17, 2026 16:29
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress This PR is in WIP state label Apr 17, 2026
@GAUNSD
Copy link
Copy Markdown
Contributor Author

GAUNSD commented Apr 17, 2026

Ready for review!

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
packages/autorag/bff/internal/api/s3_handler.go (2)

29-36: Nit: timer not released early; relies on 15s expiry.

s3MetadataTimeout doc is fine. Side note on the consumer at Line 336–338: defer metadataCancel() keeps the child context (and its timer) alive for the entire PostS3FileHandler body even though the deadline is only needed for resolveNonCollidingS3Key. Since UploadObject uses r.Context() directly, you can call metadataCancel() right after the resolve call to release the timer sooner. Bounded by 15s anyway, so purely cosmetic.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/autorag/bff/internal/api/s3_handler.go` around lines 29 - 36, The
child context and its timer created in PostS3FileHandler for calling
resolveNonCollidingS3Key are kept alive by a deferred metadataCancel(), so
change the flow to call metadataCancel() immediately after
resolveNonCollidingS3Key returns (instead of deferring) to release the timer
earlier; locate the metadataCancel/resolveNonCollidingS3Key invocation in
PostS3FileHandler and move the cancel call to directly follow the
resolveNonCollidingS3Key result before proceeding to UploadObject which should
continue to use r.Context().

72-82: Cross-package duplication — symmetric copy lives in packages/automl/bff/internal/api/s3_handler.go.

isS3ConnectivityError, s3ConnectivityErrorMessage, and s3MetadataTimeout are being maintained in two BFF packages with no shared module between them. Any future change (e.g., widening the net.OpError classification, or adjusting the user-facing message) now has two landing sites and will drift. Not a blocker — the modules are intentionally independent — but consider extracting to a shared bff-common module when module consolidation is on the table.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/autorag/bff/internal/api/s3_handler.go` around lines 72 - 82, The
three functions isS3ConnectivityError, s3ConnectivityErrorMessage, and
s3MetadataTimeout are duplicated across BFF packages; extract them into a single
shared bff-common module (or package) and update both packages to import and
call the shared implementations instead of maintaining separate copies; ensure
the shared package exposes clearly named functions (e.g., IsS3ConnectivityError,
S3ConnectivityErrorMessage, S3MetadataTimeout) and update any references in the
local s3_handler.go files to use the new package to avoid drift on future
changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/autorag/bff/internal/api/s3_handler.go`:
- Around line 48-70: Update the docstring for isS3ConnectivityError to
explicitly state it only detects pre-request network failures (e.g., context
deadline, DNS failure, dial timeout, connection refused) and does NOT cover
post-dial errors like TLS handshake failures or mid-request connection resets;
replace the current ambiguous comment with the suggested clearer wording so
future readers know this function intentionally only maps those pre-dial
failures to 503 while other errors remain 500.

---

Nitpick comments:
In `@packages/autorag/bff/internal/api/s3_handler.go`:
- Around line 29-36: The child context and its timer created in
PostS3FileHandler for calling resolveNonCollidingS3Key are kept alive by a
deferred metadataCancel(), so change the flow to call metadataCancel()
immediately after resolveNonCollidingS3Key returns (instead of deferring) to
release the timer earlier; locate the metadataCancel/resolveNonCollidingS3Key
invocation in PostS3FileHandler and move the cancel call to directly follow the
resolveNonCollidingS3Key result before proceeding to UploadObject which should
continue to use r.Context().
- Around line 72-82: The three functions isS3ConnectivityError,
s3ConnectivityErrorMessage, and s3MetadataTimeout are duplicated across BFF
packages; extract them into a single shared bff-common module (or package) and
update both packages to import and call the shared implementations instead of
maintaining separate copies; ensure the shared package exposes clearly named
functions (e.g., IsS3ConnectivityError, S3ConnectivityErrorMessage,
S3MetadataTimeout) and update any references in the local s3_handler.go files to
use the new package to avoid drift on future changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 83b9eef6-b0ed-4388-b2bc-7a88c2794f57

📥 Commits

Reviewing files that changed from the base of the PR and between 82cf7ab and 966724e.

📒 Files selected for processing (10)
  • packages/automl/bff/internal/api/s3_handler.go
  • packages/automl/bff/internal/api/s3_handler_test.go
  • packages/automl/bff/internal/integrations/s3/client.go
  • packages/automl/bff/internal/integrations/s3/client_test.go
  • packages/automl/frontend/src/app/components/common/FileExplorer/__tests__/FileExplorer.spec.tsx
  • packages/autorag/bff/internal/api/s3_handler.go
  • packages/autorag/bff/internal/api/s3_handler_test.go
  • packages/autorag/bff/internal/integrations/s3/client.go
  • packages/autorag/bff/internal/integrations/s3/client_test.go
  • packages/autorag/frontend/src/app/components/common/FileExplorer/__tests__/FileExplorer.spec.tsx
✅ Files skipped from review due to trivial changes (2)
  • packages/automl/frontend/src/app/components/common/FileExplorer/tests/FileExplorer.spec.tsx
  • packages/automl/bff/internal/integrations/s3/client_test.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • packages/autorag/frontend/src/app/components/common/FileExplorer/tests/FileExplorer.spec.tsx
  • packages/autorag/bff/internal/integrations/s3/client_test.go
  • packages/automl/bff/internal/integrations/s3/client.go
  • packages/autorag/bff/internal/integrations/s3/client.go
  • packages/automl/bff/internal/api/s3_handler.go
  • packages/automl/bff/internal/api/s3_handler_test.go

Comment thread packages/autorag/bff/internal/api/s3_handler.go Outdated
GAUNSD and others added 2 commits April 17, 2026 12:51
Generated-by: Claude <noreply@anthropic.com>
Co-authored-by: Claude <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/automl/bff/internal/api/s3_handler.go`:
- Around line 334-337: The new isS3ConnectivityError branch is indented with
spaces causing gofmt/golangci-lint failures; update the indentation of the block
containing isS3ConnectivityError(err) { ... } so it uses tabs to match the
surrounding code (the block that calls app.serviceUnavailableResponseWithMessage
and s3ConnectivityErrorMessage), then run gofmt/gofmt -w or golangci-lint to
verify formatting; ensure the isS3ConnectivityError,
app.serviceUnavailableResponseWithMessage, and s3ConnectivityErrorMessage calls
remain unchanged except for indentation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 1653f460-ed3a-4db5-b351-47076738ae4c

📥 Commits

Reviewing files that changed from the base of the PR and between 966724e and 3997eaa.

📒 Files selected for processing (6)
  • packages/automl/bff/internal/api/s3_handler.go
  • packages/automl/bff/internal/api/s3_handler_test.go
  • packages/automl/bff/internal/integrations/s3/client.go
  • packages/autorag/bff/internal/api/s3_handler.go
  • packages/autorag/bff/internal/api/s3_handler_test.go
  • packages/autorag/bff/internal/integrations/s3/client.go
✅ Files skipped from review due to trivial changes (1)
  • packages/autorag/bff/internal/api/s3_handler.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • packages/automl/bff/internal/integrations/s3/client.go
  • packages/autorag/bff/internal/integrations/s3/client.go
  • packages/automl/bff/internal/api/s3_handler_test.go

Comment thread packages/automl/bff/internal/api/s3_handler.go Outdated
@chrjones-rh chrjones-rh self-assigned this Apr 20, 2026
@chrjones-rh
Copy link
Copy Markdown
Contributor

Automated tests clean:

image

AI-assisted review clean:

Code Review

No blocking, high, or medium severity issues.

The PR has two well-scoped changes:

S3 timeout/connectivity error handling (major):
  - RetryMaxAttempts: 1 — prevents the SDK from retrying 3x against unreachable endpoints, cutting timeout from ~90s to ~10s
  - s3ConnectTimeout = 10 * time.Second — explicit connect timeout well under OpenShift's 30s route timeout
  - s3MetadataTimeout = 15 * time.Second — context deadline for metadata operations (list, schema, exists) ensuring fast failure
  - isS3ConnectivityError() — comprehensive check covering context.DeadlineExceeded, net.Error.Timeout(), net.OpError{Op: "dial"}, net.DNSError, and net.ErrClosed
  - Returns 503 with user-friendly message instead of generic 500 — correct HTTP semantics for unreachable backends
  - Applied consistently across all S3 handlers (GetFile, GetFileSchema, GetFiles, PostFile) in both packages
  - Good test coverage with dedicated connectivityErrorS3Client mock

FileExplorer UX fix (minor):
  - Hides "View details" kebab action when the file's details are already displayed
  - Clean conditional based on whether the file is currently being viewed
  - Tests cover both cases (file being viewed vs not)

Clean, well-tested PR. No issues to address.

Manual testing underway.

@chrjones-rh
Copy link
Copy Markdown
Contributor

@GAUNSD do we also want to disable or hide the view details action from the main file list component when the details are already active for that specific file?

image

@GAUNSD
Copy link
Copy Markdown
Contributor Author

GAUNSD commented Apr 20, 2026

@GAUNSD do we also want to disable or hide the view details action from the main file list component when the details are already active for that specific file?

Good catch @chrjones-rh! I'll add that now

@GAUNSD
Copy link
Copy Markdown
Contributor Author

GAUNSD commented Apr 20, 2026

To avoid a situation where the overflow menu for a table row is empty ("View details" is hidden for files that are being viewed and "Remove selection" is only shown selected files).
image

I have three options:

  • A: Remove the ... overflow menu CTA (Dont like this one as it's kind of awkward to not see it for just 1 row)
  • B: Simply disable the item instead of removing it
  • C: Add a new option called "Select file"

Will go with C that way we don't have a case where an empty overflow menu is rendered and instead we have another useful way of selecting a file

@GAUNSD
Copy link
Copy Markdown
Contributor Author

GAUNSD commented Apr 20, 2026

Actually @chrjones-rh I'll get it added in a future PR.

GAUNSD and others added 2 commits April 20, 2026 17:46
- Hide "View details" kebab action when file is already being viewed
- Add "Select file"/"Select folder" kebab action for unselected items
- Delegate onRemoveSelection to parent handler to clear filesToView
- Add table row kebab actions test suite (7 tests)
- Fix eye icon test that relied on redundant "View details" click

Generated-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
@GAUNSD
Copy link
Copy Markdown
Contributor Author

GAUNSD commented Apr 20, 2026

With the other related PR now merged in I just pushed some fixes and addressed your earlier point @chrjones-rh.

@chrjones-rh
Copy link
Copy Markdown
Contributor

Automated tests are clean:
image

Manual tests look good:
image

image image

/lgtm
/approve

@GAUNSD
Copy link
Copy Markdown
Contributor Author

GAUNSD commented Apr 21, 2026

/retest

@openshift-ci openshift-ci Bot removed the lgtm label Apr 21, 2026
@chrjones-rh
Copy link
Copy Markdown
Contributor

/lgtm
/approve

@openshift-ci openshift-ci Bot added the lgtm label Apr 21, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 21, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chrjones-rh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot Bot merged commit b87f1ee into opendatahub-io:main Apr 21, 2026
54 of 55 checks passed
@GAUNSD GAUNSD deleted the gmurcia/file-explorer-timeout-airgapped branch April 21, 2026 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants