Claude Project Context

This file provides project context for Claude-based development environments (Claude Code, claude.ai, etc.). For Cursor IDE users, see .cursor/rules/ for auto-loaded context rules.

For non-Cursor environments: See .cursor/prompts/ for task-specific guides:

run-tests.md - Run pytest with various options
troubleshoot-tests.md - Diagnose test failures
connect-cluster.md - Set up OpenShift cluster access
deploy-chart.md - Deploy the Helm chart
check-logs.md - View component logs
debug-e2e.md - Debug E2E test failures
download-ci-artifacts.md - Download CI artifacts from Prow/GCS

Project Overview

This is a Helm chart for deploying Red Hat Cost Management on-premise (cost-onprem). It includes comprehensive pytest-based testing infrastructure.

Key Directories

cost-onprem/ - Helm chart templates and values
tests/ - Pytest test suite
scripts/ - Deployment and testing scripts
docs/ - Documentation

Requirements

Python 3.10+ (CI uses Python 3.11)
OpenShift CLI (oc) with cluster access
Helm 3.x
gcloud CLI (for downloading CI artifacts)

Code Style

Python: Follow PEP 8, use type hints where helpful
Bash: Use set -euo pipefail, quote variables
YAML: 2-space indentation for Helm templates
Tests: Use descriptive names, include docstrings

Testing Infrastructure

OpenShift CI Execution Flow

In OpenShift CI, tests are executed via the insights-onprem-cost-onprem-chart-e2e step:

CI Step Registry: insights-onprem/cost-onprem-chart/e2e/
├── insights-onprem-cost-onprem-chart-e2e-commands.sh  # Main CI script
├── insights-onprem-cost-onprem-chart-e2e-ref.yaml     # Step definition

CI Execution Sequence:

Dependencies: Installs yq, kubectl, helm, oc
S4 Setup: Reads config from insights-onprem-s4-deploy step
Cost Management Metrics Operator: Installs via OLM (stable channel)
Helm Wrapper: Injects S4 storage config for cost-onprem chart
Deploy & Test: Runs scripts/deploy-test-cost-onprem.sh --namespace cost-onprem --verbose

Default CI Test Run:

# What CI executes (via deploy-test-cost-onprem.sh):
NAMESPACE=cost-onprem ./scripts/run-pytest.sh

# Equivalent to:
pytest -m "not extended" --junit-xml=reports/junit.xml

CI runs ~88 tests in ~3 minutes (excludes extended tests that require ODF/S3).

Running Tests Locally

# Run all tests (including UI)
NAMESPACE=cost-onprem ./scripts/run-pytest.sh

# Run all tests except UI
NAMESPACE=cost-onprem ./scripts/run-pytest.sh --no-ui

# Specific suites
./scripts/run-pytest.sh --helm
./scripts/run-pytest.sh --auth
./scripts/run-pytest.sh --e2e
./scripts/run-pytest.sh --infrastructure
./scripts/run-pytest.sh --ros
./scripts/run-pytest.sh --ui

Test Markers

component - Single-component tests
integration - Multi-component tests
extended - Long-running tests (skipped by default in CI)
smoke - Quick validation tests

Test Cleanup

E2E_CLEANUP_BEFORE=true   # Clean before tests (default)
E2E_CLEANUP_AFTER=true    # Clean after tests (default)
E2E_RESTART_SERVICES=false # Restart Valkey/listener (optional)

Kubernetes Label Conventions

IMPORTANT: The chart uses app.kubernetes.io/component for pod selection, NOT app.kubernetes.io/name.

Component	Label Selector
Database	`app.kubernetes.io/component=database`
Ingress	`app.kubernetes.io/component=ingress`
Kruize	`app.kubernetes.io/component=ros-optimization`
ROS API	`app.kubernetes.io/component=ros-api`
ROS Processor	`app.kubernetes.io/component=ros-processor`
Cache (Valkey)	`app.kubernetes.io/component=cache`
Koku Listener	`app.kubernetes.io/component=listener`
MASU	`app.kubernetes.io/component=cost-processor`
Celery Workers	`app.kubernetes.io/component=cost-worker`

Common Commands

# Check pod status
kubectl get pods -n cost-onprem -l app.kubernetes.io/instance=cost-onprem

# View logs by component
kubectl logs -n cost-onprem -l app.kubernetes.io/component=listener --tail=100
kubectl logs -n cost-onprem -l app.kubernetes.io/component=ros-processor --tail=100

Troubleshooting Guide

Tests Skipping with "Database pod not found"

Tests use app.kubernetes.io/component=database label selector.

kubectl get pods -n cost-onprem -l app.kubernetes.io/component=database

S3 SignatureDoesNotMatch Errors

Root Cause: boto3 defaults to virtual-hosted style URLs which don't work with NooBaa/Ceph RGW. Fix: Chart configures boto3 for path-style S3 addressing via:

cost-onprem-aws-config ConfigMap (sets addressing_style = path)
AWS_CONFIG_FILE=/etc/aws/config environment variable

ROS Processor TLS Certificate Errors

Symptom: "x509: certificate signed by unknown authority" or CSV parse errors. Root Cause: Go's x509.SystemCertPool() doesn't include OpenShift service CA. Fix: Chart uses initContainer.prepareCABundle to combine CAs.

Summary Tables Not Populated (test_06)

Root Cause: manifest.json must include start and end date fields. Fix: tests/utils.py create_upload_package() includes these fields.

Kruize Experiments Not Created (test_07)

Possible Causes:

TLS certificate issues
S3 URL encoding issues
Wrong data format - NISE must use --ros-ocp-info flag

Verification:

kubectl logs -n cost-onprem -l app.kubernetes.io/component=ros-processor --tail=50

Helm Upgrade "field is immutable"

Root Cause: Label changes require fresh install.

helm uninstall cost-onprem -n cost-onprem
helm install cost-onprem ./cost-onprem -n cost-onprem -f openshift-values.yaml --wait

NISE Data Generation

The E2E tests use NISE (koku-nise) to generate proper OCP cost data.

Basic Usage

nise report ocp --ros-ocp-info --static-report-file static.yml --write-monthly

Key Flags

--ros-ocp-info - Generate container-level data for ROS processor (REQUIRED for ROS tests)
--write-monthly - Organize output by month
--static-report-file - Use predefined workload configuration

Manifest Structure

Upload tarball must have proper manifest.json:

{
  "uuid": "...",
  "cluster_id": "...",
  "version": "...",
  "date": "2026-01-22",
  "start": "2026-01-20",
  "end": "2026-01-22",
  "files": ["pod_usage.csv"],
  "resource_optimization_files": ["ros_usage.csv"]
}

Important Notes

files array: Pod-level data for Koku processing
resource_optimization_files array: Container-level data for ROS processor
Both start and end dates are REQUIRED for summary table population

Cluster Access

Required Information

To connect to an OpenShift cluster for testing/troubleshooting, you need:

Cluster API URL: e.g., api.ocp-edge94.qe.lab.redhat.com:6443
Username: e.g., kubeadmin
Password: The cluster admin password

Login Command

oc login -s <CLUSTER_API_URL> -u <USERNAME> --password <PASSWORD>

Verify Connection

oc whoami                                    # Check logged in user
oc cluster-info                              # Check cluster info
kubectl get pods -n cost-onprem              # Check deployment

Environment Variables

export NAMESPACE="cost-onprem"               # Target namespace
export KEYCLOAK_NAMESPACE="keycloak"         # Keycloak namespace
export HELM_RELEASE_NAME="cost-onprem"       # Helm release name

Quick Reference Commands

Run Tests

# CI mode (~88 tests, ~3 min)
NAMESPACE=cost-onprem ./scripts/run-pytest.sh

# Extended tests (~15 min, requires ODF)
NAMESPACE=cost-onprem ./scripts/run-pytest.sh --extended

# Specific suite
./scripts/run-pytest.sh --e2e

Check Logs

# Koku listener
kubectl logs -n cost-onprem -l app.kubernetes.io/component=listener --tail=100

# ROS processor
kubectl logs -n cost-onprem -l app.kubernetes.io/component=ros-processor --tail=100

# Kruize
kubectl logs -n cost-onprem -l app.kubernetes.io/component=ros-optimization --tail=100

Deploy Chart

# Full deployment + chart tests
./scripts/deploy-test-cost-onprem.sh --namespace cost-onprem --verbose

# Deploy only — skip chart tests
./scripts/deploy-test-cost-onprem.sh --skip-chart-tests

# Chart tests only (skip deployment)
./scripts/deploy-test-cost-onprem.sh --skip-deploy

# Dry run to preview what would execute
./scripts/deploy-test-cost-onprem.sh --dry-run --verbose

After modifying flag parsing in deploy-test-cost-onprem.sh, validate all permutations:

./scripts/qe/test-gh-workflow-locally.sh .github/workflows/validate-deploy-test-script.yml

Troubleshoot

# All pods
kubectl get pods -n cost-onprem -l app.kubernetes.io/instance=cost-onprem

# Recent events
kubectl get events -n cost-onprem --sort-by='.lastTimestamp' | tail -20

# Search for errors
kubectl logs -n cost-onprem -l app.kubernetes.io/component=ros-processor | grep -i error

Download CI Artifacts

# Download from Prow URL
./scripts/download-ci-artifacts.sh --url "<PROW_URL>"

# Download by PR and build ID
./scripts/download-ci-artifacts.sh <PR_NUMBER> <BUILD_ID>

Note: Downloaded artifacts are saved to ci-artifacts-pr<PR>-<BUILD_ID>/ and should NOT be deleted unless explicitly requested by the user.

IQE Integration Testing

IQE (Insights QE) tests provide comprehensive integration testing for cost-management functionality.

Prerequisites

Red Hat Network: Must be on VPN for repository access
Quay.io Access: For containerized tests, need access to quay.io/cloudservices/iqe-tests
- Requires user file in app-interface repo: data/teams/insights/users/<username>.yml
Local Repositories: For local tests, clone adjacent to this repo:
```
../iqe-core/
../iqe-cost-management-plugin/
```

Running IQE Tests

# IQE only — skip deploy + chart tests, boost listener CPU (recommended)
./scripts/deploy-test-cost-onprem.sh --iqe-only \
    --listener-cpu max --iqe-profile smoke

# Containerized standalone (no CPU boost)
./scripts/run-iqe-tests.sh --profile smoke

# Local (requires VPN + local repos)
./scripts/run-iqe-tests-local.sh --setup              # First time
./scripts/run-iqe-tests-local.sh --profile smoke       # Run tests

Test Profiles

Profile	Tests	Duration	Use Case
`smoke`	~43	~17 min	PR checks
`extended`	~2100	~33 min	Daily CI
`stable`	~2350	~40 min	Weekly CI
`full`	~3324	~60 min	Release validation

Tests are I/O-bound waiting for backend data processing. Use --listener-cpu max to boost the listener deployment's CPU during the run (~40-50% faster ingestion).

Known Issues

Tests are organized into skip groups with SKIP_* env vars. Key blockers:

COST-7179 — GPU/MIG schema mismatch blocks ~90 tests + cascading failures
90-day data — NISE generates ~60 days; 90-day range tests fail (~228 tests)
FLPATH-3423 — Source CRUD update returns 500 (1 test)

See docs/development/skipped-iqe-tests.md for full details on all skip groups, pytest markers, and test profiles.

See docs/development/test-impact-map.md for the automated test recommendation system that maps component changes to IQE profiles (scripts/qe/test-impact-map.yaml).

See docs/development/iqe-testing-setup.md for full setup guide.

Uh oh!

FilesExpand file tree

CLAUDE.md

Latest commit

History