-
Notifications
You must be signed in to change notification settings - Fork 211
fix(ci): fix RHDH OCP Orchestrator Helm e2e job failures #3929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release-1.8
Are you sure you want to change the base?
fix(ci): fix RHDH OCP Orchestrator Helm e2e job failures #3929
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/ok-to-test |
The orchestrator workflows table selector was looking for "WorkflowsNameCategoryLast" but the actual UI only displays columns: Name, Workflow Status, Last run, Last run status, Description, Actions. The "Category" column does not exist in the release-1.8 UI, causing the orchestrator RBAC tests to fail with element not found errors. This fix updates the selector to match the actual table header text "Workflows" which is present in the UI. Backported from commit f17d95b (PR redhat-developer#3406) in main branch. Fixes failing test: - Test Orchestrator RBAC > Test global orchestrator workflow access is allowed Related: FLPATH-2798
… install Add --wait --timeout=5m flags to the greeting workflow helm install command to ensure workflow pods are ready before tests execute. Without --wait, the helm command returns immediately while pods are still initializing, which can cause: - Tests to run before workflows are available - Race conditions between workflow deployment and test execution - Pods experiencing CreateContainerConfigError during startup With --wait, helm monitors the release and only returns success when all pods are Running and pass readiness probes. The 5-minute timeout provides ample time for the pods to start (observed ready time: ~90 seconds). This ensures tests only run against fully-initialized infrastructure and provides clearer failure messages if pods cannot start. Related: FLPATH-2798
…se creation Add manual database creation workaround for showcase-rbac deployment to handle SSL-required connections to external Crunchy Data PostgreSQL clusters. The helm chart's create-sonataflow-database job does not inject PGSSLMODE environment variable, causing authentication failures when connecting to external PostgreSQL instances that require SSL (Crunchy Data operator). This fix adds: - create_sonataflow_database_with_ssl() helper function - Temporary pod that runs psql with PGSSLMODE=require - Proper SSL configuration from postgres-cred secret Without SSL configuration: FATAL: no pg_hba.conf entry for host "X.X.X.X", user "janus-idp", database "postgres", no encryption This resolves CrashLoopBackOff issues in showcase-rbac namespace for: - greeting workflow - user-onboarding workflow - sonataflow-platform-data-index-service - sonataflow-platform-jobs-service Related: FLPATH-2798
- Increase timeout from 2 minutes to 5 minutes to handle image pull delays and rate limiting - Add database verification step to confirm successful creation - Improve status reporting during pod creation with status change logging - Add wait for jobs-service rollout before deploying workflows to prevent race conditions - Better error handling and logging throughout the process This addresses issues where the manual database creation pod was timing out due to ImagePullBackOff delays (QPS exceeded) in the CI environment.
Separate variable declarations from assignments to avoid masking return values. This resolves ShellCheck warnings in: - create_sonataflow_database_with_ssl() function (line 889) - verify_sonataflow_database() function (lines 983, 992)
- Return error code 1 when database creation pod fails - Return error code 1 when database creation times out - Clean up pod and show logs before returning on failure - Change WARNING to ERROR for actual failure cases
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The securityContext with readOnlyRootFilesystem: true was preventing psql from working properly because it needs to write temporary files to /tmp during SSL connections to the external PostgreSQL database. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The default 2Gi ephemeral volume for dynamic-plugins-root is insufficient when many plugins are enabled (orchestrator, kubernetes, tekton, techdocs, keycloak, etc.). The init container fails with "No space left on device" error during plugin extraction. Increase the volume size to 5Gi for both showcase and RBAC namespaces using the deployment.patch field in the Backstage CR. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The default 10-second actionTimeout was being exceeded when the Keycloak popup was slow to render, causing orchestrator RBAC tests to fail during authentication setup. Add explicit waitFor with 30-second timeout before interacting with the Keycloak login form to handle slow responses. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
29a5eae to
c6a9203
Compare
|
/test e2e-tests |
|
@chadcrum: The specified target(s) for The following commands are available to trigger optional jobs: Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
You are above your monthly Qodo Merge usage quota. If you are a paying user, please link your GitHub/GitLab/Bitbucket account with your qodo account here to claim your seat. To allow usage organization-wide without linking, please reach to Qodo. |
|
/test e2e-ocp-helm |
|
You are above your monthly Qodo Merge usage quota. If you are a paying user, please link your GitHub/GitLab/Bitbucket account with your qodo account here to claim your seat. To allow usage organization-wide without linking, please reach to Qodo. |
|
The image is available at: /test e2e-ocp-helm |
|
/retest |
|
/test e2e-ocp-helm |
|
You are above your monthly Qodo Merge usage quota. If you are a paying user, please link your GitHub/GitLab/Bitbucket account with your qodo account here to claim your seat. To allow usage organization-wide without linking, please reach to Qodo. |
|
/test e2e-ocp-helm |
|
You are above your monthly Qodo Merge usage quota. If you are a paying user, please link your GitHub/GitLab/Bitbucket account with your qodo account here to claim your seat. To allow usage organization-wide without linking, please reach to Qodo. |
|
/test e2e-ocp-helm |
|
You are above your monthly Qodo Merge usage quota. If you are a paying user, please link your GitHub/GitLab/Bitbucket account with your qodo account here to claim your seat. To allow usage organization-wide without linking, please reach to Qodo. |
|
/test e2e-ocp-helm |
|
You are above your monthly Qodo Merge usage quota. If you are a paying user, please link your GitHub/GitLab/Bitbucket account with your qodo account here to claim your seat. To allow usage organization-wide without linking, please reach to Qodo. |
|
@christoph-jerolimov @subhashkhileri I'm working with @gustavolira to help stabilize the ocp helm/operator rhdh jobs (related to orchestrator). As @gustavolira is out until next year, can one of you take a look at this? |
|
The image is available at: /test e2e-ocp-helm |
Summary
Fix multiple issues causing RHDH OCP Orchestrator Helm e2e jobs (e2e-ocp-helm) to fail in the showcase-rbac namespace.
Root Cause: The helm chart's create-sonataflow-database job does not include the PGSSLMODE environment variable, causing database creation to fail when connecting to external PostgreSQL instances that require SSL (Crunchy Data PostgreSQL).
Fixes included:
Jira: RHDHBUGS-2449
Test plan
Test Results
✅ Tested 5 times - all helm deployments deployed without issue and all runs passed with zero failures.
Note: Variance in passed/skipped counts is due to conditional test skipping in
rbac.spec.tsbased on environment timing, not failures.🤖 Generated with Claude Code