Refactor/ipa telco kpis prow migration rds compare test#462
Open
ccardenosa wants to merge 12 commits intoopenshift-kni:mainfrom
Open
Refactor/ipa telco kpis prow migration rds compare test#462ccardenosa wants to merge 12 commits intoopenshift-kni:mainfrom
ccardenosa wants to merge 12 commits intoopenshift-kni:mainfrom
Conversation
Implement role-based container image mirroring system for internal registry
management, supporting both mirror and removal operations with authentication.
## Problem
Telco-KPIs testing requires mirroring container images to internal registries
for disconnected environments and test image management. Previous approach used
inline playbook tasks without reusability.
## Solution
Created dedicated `container_image_mirror` Ansible role with playbooks for both
mirroring and removal operations:
**Role: playbooks/roles/container_image_mirror/**
- Supports 'mirror' and 'remove' operations via parameter
- Uses skopeo for image operations
- Handles authentication with pull secrets
- Continues operation even if some images fail
- Comprehensive success/failure reporting with summary
**Playbooks:**
- `playbooks/mirror-images.yml` - Mirror images to internal registry
- `playbooks/remove-images.yml` - Remove images from registry storage
## Features
**Authentication:**
- Pull secret support for private registries (via pull_secret_string or pull_secret_path)
- System default auth when no pull secret provided
- Configurable auth file location (/tmp for bastion compatibility)
- use_pull_secret flag to control authentication method
**Registry Configuration:**
- Configurable registry host/port/namespace
- TLS verification control
- Source and destination registry support
**Operations:**
- Idempotent with existence checks
- Detailed mirror/removal summary
- Error handling continues operation on failures
## Usage
**Mirror images:**
```bash
ansible-playbook playbooks/mirror-images.yml \
-e images='[{"source": "quay.io/image:tag", "dest": "registry.local/namespace/image:tag"}]' \
-e registry_host=registry.local \
-e pull_secret_string='{"auths": {...}}'
```
**Remove images:**
```bash
ansible-playbook playbooks/remove-images.yml \
-e images='[{"dest": "registry.local/namespace/image:tag"}]' \
-e registry_host=registry.local
```
## Implementation Details
**Role Structure:**
- `defaults/main.yaml` - Default variables
- `tasks/main.yaml` - Entry point with operation dispatch
- `tasks/mirror.yaml` - Mirror images using skopeo
- `tasks/remove.yaml` - Remove images from registry storage
- `meta/main.yaml` - Role metadata
- `README.md` - Comprehensive documentation
**Key Variables:**
- `container_image_mirror_operation`: "mirror" or "remove"
- `container_image_mirror_images`: List of image objects
- `container_image_mirror_registry_host`: Target registry hostname
- `container_image_mirror_pull_secret_string`: JSON pull secret
- `container_image_mirror_use_pull_secret`: Enable/disable authentication
## Benefits
- Reusable role for both mirror and removal operations
- Cleaner separation of concerns
- Easier to test and maintain
- Follows eco-ci-cd role patterns (like ocp_operator_mirror)
- Well-documented with examples
- Jenkins job compatible (uses same variable names)
## Jenkins Integration
Used by `telco-kpis-mirror-ran-test-images` Jenkins job for mirroring RAN test
images to internal registries.
Related: Telco-KPIs test infrastructure
Signed-off-by: Carlos Cardenosa <ccardeno@redhat.com>
Add parse-lockdown.yml playbook that extracts deployment parameters from
lockdown JSON files, enabling decoupled parameter management for reproducible
deployments.
## Problem
Telco-KPIs testing requires exact software versions (OCP releases, operator
channels, catalogs) for reproducible deployments. Parsing lockdown JSON inline
within deployment playbooks creates tight coupling and makes parameter reuse
across multiple jobs difficult.
## Solution
Implement standalone parser playbook that runs before deployment jobs:
1. Downloads and parses lockdown JSON from URI
2. Auto-detects lockdown type (hub vs spoke) from JSON structure
3. Extracts deployment parameters
4. Outputs in shell env and JSON formats for downstream consumption
## Changes
**New playbook: playbooks/telco-kpis/parse-lockdown.yml**
- Auto-detection logic: checks for 'hub' vs 'deployment' key in JSON
- Hub parsing: extracts OCP_RELEASE_IMAGE, ACM_CHANNEL, MCE_CHANNEL, catalogs
- Spoke parsing: extracts OCP_PULL_SPEC, ZTP_PULL_SPEC, operator configurations
- SSL certificate bypass for internal GitLab instances (validate_certs: false)
- Dynamic artifact naming using lockdown filename from URI
- Outputs three artifacts per run:
- `{lockdown-name}.json`: Original lockdown file
- `{lockdown-name}-params.env`: Shell environment variables
- `{lockdown-name}-params.json`: Structured JSON parameters
**New role: playbooks/telco-kpis/roles/lockdown_hub_config/**
- tasks/main.yml: Download, validate, and parse hub lockdown JSON
- defaults/main.yml: Default configuration values
- README.md: Comprehensive role documentation
- Used by both parse-lockdown.yml and deploy-ocp-operators.yml
## Usage Workflow
**Step 1: Parse lockdown file**
```bash
ansible-playbook playbooks/telco-kpis/parse-lockdown.yml \
-e lockdown_uri=https://gitlab.cee.redhat.com/.../lockdown-hub-x86_64.json
```
**Step 2: Use extracted parameters**
```bash
# Source env file
source lockdown-hub-x86_64-params.env
# Use in deployment
ansible-playbook playbooks/deploy-ocp-sno.yml \
-e release="${OCP_RELEASE_IMAGE}"
```
## Benefits
**Decoupling:**
- Parsing separated from deployment logic
- Parameters extracted once, reused across multiple jobs
- Easier debugging with explicit parameter artifacts
**Flexibility:**
- Supports multiple lockdown formats (hub, spoke, baseline)
- Self-documenting artifacts with actual lockdown names
- Both shell and JSON output formats
**Prow-ready:**
- Clean separation aligns with Prow step registry architecture
- Parser step can run independently, output shared via SHARED_DIR
## Key Features
**Auto-detection:**
```yaml
lockdown_type: "{{ 'hub' if ('hub' in lockdown_data) else 'spoke' }}"
```
**Dynamic artifact naming:**
```yaml
lockdown_filename: "{{ lockdown_uri | regex_replace('.*/', '') | regex_replace('.json$', '') }}"
# Result: lockdown-hub-x86_64-params.env (not generic lockdown-params.env)
```
**Hub channel transformations:**
```yaml
hub_acm_channel: "release-{{ lockdown_data.hub.acm.version_override }}"
hub_mce_channel: "{{ lockdown_data.hub.acm.mce_override | regex_replace('^v', 'stable-') | regex_replace('\\.\\d+$', '') }}"
```
## Example Artifacts
**lockdown-hub-x86_64-params.env:**
```bash
LOCKDOWN_TYPE=hub
OCP_RELEASE_IMAGE=quay.io/openshift-release-dev/ocp-release:4.20.4-x86_64
OCP_VERSION=4.20
ACM_CHANNEL=release-2.13
MCE_CHANNEL=stable-2.8
TALM_CATALOG=quay.io/.../talm-index:v4.20
GITOPS_CATALOG=quay.io/.../gitops-index:v1.15
```
**lockdown-hub-x86_64-params.json:**
```json
{
"LOCKDOWN_TYPE": "hub",
"OCP_RELEASE_IMAGE": "quay.io/openshift-release-dev/ocp-release:4.20.4-x86_64",
"OCP_VERSION": "4.20",
"ACM_CHANNEL": "release-2.13",
"MCE_CHANNEL": "stable-2.8",
"TALM_CATALOG": "quay.io/.../talm-index:v4.20",
"GITOPS_CATALOG": "quay.io/.../gitops-index:v1.15"
}
```
## Verification
Tested with both lockdown types:
- Hub lockdown: Successfully extracted OCP 4.20.4 pull spec and operator channels
- Spoke lockdown: Successfully extracted spoke deployment parameters
Artifacts correctly named with lockdown filename and contain expected parameters.
Related: Telco-KPIs reproducible deployment system
Signed-off-by: Carlos Cardenosa <ccardeno@redhat.com>
Add comprehensive troubleshooting guides for common Telco-KPIs deployment and testing issues. ## Added Documentation **playbooks/telco-kpis/docs/troubleshooting/** 1. **prometheus-pod-stuck-reboot-test-blocker.md** - Issue: Prometheus pod stuck in Init:0/1 causing reboot tests to skip - Root cause: Corrupted alertmanager-main-generated ConfigMap (OCPBUGS-65953, OCPBUGS-70352) - Impact: CNF-gotests BeforeEach health check fails - Workaround: Fix ConfigMap data and restart pod - Prevention: Automated fix option for post-deployment playbooks 2. **k8s-exec-ipv6-fallback-issue.md** - Issue: kubernetes.core.k8s_exec fails with "No route to host" in dual-stack environments - Root cause: Python websocket-client library doesn't fall back to IPv4 - Impact: Blocks pod exec operations (BIOS/microcode collection, hardware info) - Solution: Use `oc exec` via ansible.builtin.shell instead - Verification: Tested on spree-02 cluster (2026-04-30) ## Benefits - Faster troubleshooting with documented solutions - Reduces repeated investigation of known issues - Provides context (bug IDs, verification dates) for future reference - Includes both workarounds and permanent solutions Related: Telco-KPIs test infrastructure reliability
Implement Gitea deployment and report publishing infrastructure for hosting
Telco-KPIs test reports with vault integration and retention policies.
## Problem
Telco-KPIs test reports need centralized hosting accessible to test engineers
and stakeholders. Reports generated on bastion hosts need automated publishing
to a Git-based repository system with retention management.
## Solution
Created `gitea` Ansible role for deploying Gitea server and publishing test
reports as Markdown files with compressed artifacts.
**Role: playbooks/telco-kpis/roles/gitea/**
## Features
**Deployment:**
- Podman-based Gitea server deployment on bastion
- SQLite database with automatic migration handling
- Firewall configuration (port 3000)
- Accessibility checking instead of container existence
- Admin user creation with API token management
**Report Publishing:**
- Creates organization and repositories automatically
- Publishes Markdown reports via Gitea API
- Uploads compressed tarball as release artifact
- Updates repository README with latest report links
- Retention policy: keeps last 15 reports, removes older ones
**Vault Integration:**
- Gitea credentials stored in Ansible vault
- Secure API token management
- Credential validation before operations
## Implementation Details
**Role Structure:**
- `tasks/main.yml` - Entry point with operation dispatch
- `tasks/deploy.yml` - Gitea server deployment
- `tasks/initialize.yml` - Initial configuration and admin setup
- `tasks/validate-credentials.yml` - Vault credential validation
- `tasks/create-repository.yml` - Repository creation
- `tasks/publish-report.yml` - Report publishing workflow
- `defaults/main.yml` - Default variables
- `templates/README.md.j2` - Repository README template
**Task Operations:**
- `gitea_operation: deploy` - Deploy and initialize Gitea server
- `gitea_operation: publish` - Publish test report
- `gitea_operation: validate` - Validate vault credentials
## Usage
**Deploy Gitea:**
```yaml
- name: Deploy Gitea server
ansible.builtin.include_role:
name: gitea
vars:
gitea_operation: deploy
gitea_vault_org: telco-kpis
```
**Publish report:**
```yaml
- name: Publish test report
ansible.builtin.include_role:
name: gitea
vars:
gitea_operation: publish
gitea_vault_org: telco-kpis
gitea_vault_repo: hlxcl7-reports
gitea_report_file: /path/to/report.md
gitea_artifact_file: /path/to/artifacts.tar.gz
```
## Key Features
**Firewall Management:**
- Detects firewalld vs. iptables
- Adds port 3000 rule if not present
- Handles both firewall backends
**Database Migration:**
- Waits for database initialization on first run
- Handles migration errors gracefully
- Retries admin user creation after migration
**Repository Retention:**
- Keeps last 15 reports per repository
- Automatically deletes older reports
- Prevents unbounded repository growth
**Error Handling:**
- Comprehensive API error checking
- Retries for transient failures
- Detailed error messages
## Benefits
- Centralized test report hosting
- Automated report publishing workflow
- Secure credential management via vault
- Retention policy prevents storage bloat
- Accessible web UI for stakeholders
- Git-based versioning of reports
## Integration
Used by `playbooks/telco-kpis/generate-report.yml` to publish aggregated test
reports from all Telco-KPIs tests (node-info, BIOS validation, performance
tests, deployment timeline).
Related: Telco-KPIs test infrastructure, report generation system
Signed-off-by: Carlos Cardenosa <ccardeno@redhat.com>
Creates a scalable test execution framework supporting multiple test types across spoke clusters, replacing dedicated per-test playbooks with a task-based architecture for better maintainability. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Carlos Cardenosa <ccardeno@redhat.com>
Add OSLAT (OS Latency) performance test for measuring operating system-induced latency on Telco-KPIs spoke clusters. Implements run-oslat-test.yml task with podman-based execution, artifact collection, and JUnit XML report generation for CI integration. Related: Telco-KPIs performance testing
Add Cyclictest performance test for measuring real-time kernel latency on Telco-KPIs spoke clusters. Implements run-cyclictest-test.yml task with podman-based execution, artifact collection to /artifacts directory, and JUnit XML report generation. Related: Telco-KPIs performance testing
Add PTP (Precision Time Protocol) test for validating time synchronization on Telco-KPIs spoke clusters. Implements run-ptp-test.yml task with podman-based execution, artifact collection, and JUnit XML report generation for CI integration. Related: Telco-KPIs performance testing
Add CPU utilization and reboot tests using cnf-gotests framework for Telco-KPIs spoke cluster validation. **CPU Utilization Test:** - Supports baseline mode for establishing performance baselines - Measures CPU usage patterns under load - Generates JUnit XML reports **Reboot Test:** - Validates cluster stability across node reboots - Uses cnf-gotests framework - Verifies cluster health after reboot cycles Both tests use podman-based execution with artifact collection. Related: Telco-KPIs performance testing
Add RFC2544 network performance test using ran-integration framework for Telco-KPIs spoke cluster throughput and latency validation. ## Features **Spirent Configuration:** - Loads configuration from ran-integration inventory files - Supports CHASSIS, STCWEB, PORT1, PORT2 parameters - Cluster FQDN read from vault - Fallback to hardcoded defaults **DPDK TestPMD Pod Deployment:** - Deploys dpdk-testpmd pod for packet processing - Uses kubernetes.core modules for pod management - Architecture-aware hugepages configuration (ARM64 vs x86_64) - Pod deletion verification after test completion **Test Execution:** - Clones ran-integration repository (configurable branch) - Runs RFC2544 test scripts via podman - Collects JUnit XML artifacts ## Implementation Details - Spirent config parsing using shell commands (grep/cut) - ARM detection: `is_arm` variable for architecture-specific settings - Proper Jinja2 boolean handling (False → false in templates) - Error handling with default() filters for missing variables Related: Telco-KPIs performance testing, ran-integration
Add comprehensive BIOS settings validation and hardware information collection
for Telco-KPIs spoke clusters.
## Playbooks
**playbooks/telco-kpis/collect-node-info.yml:**
- Collects hardware metadata (CPU, BIOS, firmware, NIC details)
- Queries Redfish BMC for BIOS settings
- Outputs JSON artifact: node-info-{spoke}.json
**playbooks/telco-kpis/run-bios-validation.yml:**
- Validates BIOS settings against expected profile
- Supports x86_64 (TelcoOptimizedProfile) and ARM64 architectures
- Generates JUnit XML test results
## Implementation Details
**Tasks:**
- `collect-node-info.yml` - Hardware info collection task
- `run-bios-validation-test.yml` - BIOS validation task
**Key Features:**
- BMC credentials read from vault file
- Architecture normalization (amd64 → x86_64)
- Uses `oc exec` instead of kubernetes.core.k8s_exec (IPv6 fallback workaround)
- BIOS profile parsing from Redfish API
- Microcode version collection via dmidecode
**Expected BIOS Settings (x86_64):**
- WorkloadProfile: TelcoOptimizedProfile
- ProcPwrPerf: SysDbpmTelco
- ProcC States: Enabled/Disabled per setting
- SecureBoot: Enabled with DeployedMode
Related: Telco-KPIs hardware validation
Add RDS (Reference Design Spec) comparison test for validating spoke cluster
configuration against reference deployments.
## Features
**Playbook:** playbooks/telco-kpis/run-rds-compare.yml
**Task:** tasks/run-rds-compare-test.yml
**Implementation:**
- Ensures test-runner container image exists before execution
- Uses kubectl-cluster_compare tool (added to test-runner Containerfile)
- Mounts spoke-specific artifact directory for report generation
- Artifacts mounted to both /workspace/reports/podman-runs and /workspace/reports/rds-compare
- Generates JUnit XML test results
**Container Image Management:**
- Checks for kubectl-cluster_compare binary in test-runner image
- Uses telco-kpis-test-runner:latest container
**Artifact Structure:**
- Reports written to /workspace/reports/podman-runs/{spoke}/
- Dual mount paths for compatibility with report generator
Related: Telco-KPIs configuration validation
Signed-off-by: Carlos Cardenosa <ccardeno@redhat.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.