release-tests/.claude/commands/release/drive.md at 46662718495768b37a5befb1eb0cec3e2e0328e3 · openshift/release-tests

description
Drive OpenShift z-stream release orchestration through the complete Konflux release workflow

You are helping the user drive an OpenShift z-stream release through the complete release workflow from creation to final approval.

Purpose: This command orchestrates all release tasks for a z-stream version (e.g., 4.20.1), executing tasks sequentially and managing async operations, with the goal of reaching final QE approval (advisory status changes from QE to REL_PREP).

The user has provided a release version: {{args}}

CRITICAL: Validate Release Version

BEFORE doing anything else, you MUST validate that a release version was provided:

release_version = "{{args}}".strip()

if not release_version:
    # No release version provided - ASK THE USER
    print("Error: No release version provided.")
    print("Usage: /release:drive <release-version>")
    print("Example: /release:drive 4.20.1")
    print()
    print("Please specify which z-stream release you want to drive.")
    STOP - Do not proceed, wait for user input

IMPORTANT: Never default to a hardcoded release version like 4.20.1. Always require explicit user input.

Complete Workflow Specification

IMPORTANT: The complete, authoritative workflow specification is defined in: docs/KONFLUX_RELEASE_FLOW.md

You MUST read and follow that document for:

Task execution order and dependencies
Build promotion checkpoint logic
Test result evaluation (candidate vs promoted builds)
Gate check criteria
Async task orchestration
Error handling and retry strategies
All MCP tool usage patterns

Quick Reference

Before executing ANY tasks, you MUST:

1. Retrieve Release State from StateBox

ALWAYS start by retrieving release state (contains ALL data you need):

# Get complete release context (StateBox primary, Google Sheets fallback)
state = oar_get_release_status(release="4.20.1")
state_data = json.loads(state)

# StateBox contains EVERYTHING for workflow resumption:
# - metadata: advisories, jira_ticket, builds, shipment_mr, release_date
# - tasks: status, results, timestamps
# - issues: blockers, resolutions

# Extract metadata from StateBox (NO separate call to oar_get_release_metadata needed!)
metadata = state_data.get("metadata", {})
advisory_ids = metadata.get("advisory_ids", {})
jira_ticket = metadata.get("jira_ticket", "")
release_date = metadata.get("release_date", "")
candidate_builds = metadata.get("candidate_builds", {})
shipment_mr = metadata.get("shipment_mr", "")

Data Source Priority:

StateBox (Primary): Complete state with metadata, tasks with results, and blocking issues
Google Sheets (Fallback): Task status only (if StateBox doesn't exist)

IMPORTANT: StateBox should already exist (initialized by release detector when release is announced). If oar_get_release_status returns "source": "worksheet", StateBox doesn't exist and you're running in limited mode:

Limited Mode (Google Sheets fallback):

✅ Can still execute workflow using task status from Google Sheets
✅ Task status updates via oar_update_task_status work (updates Google Sheets)
❌ No access to task execution results (can't extract Jenkins build numbers)
❌ No issue tracking (can't use oar_add_issue, oar_resolve_issue)
❌ No metadata access from StateBox

How to handle limited mode:

state = oar_get_release_status(release="4.20.1")
state_data = json.loads(state)

if state_data.get("source") == "worksheet":
    Log: "⚠️ Running in LIMITED MODE - StateBox not found, using Google Sheets"
    Log: "Some features unavailable:"
    Log: "  - Cannot resume async tasks (no build numbers in results)"
    Log: "  - Cannot track blocking issues"
    Log: "  - Task results not available for context"
    Log: "  - No metadata in StateBox"

    # Get metadata separately (ONLY in limited mode)
    metadata_result = oar_get_release_metadata(release="4.20.1")
    metadata = json.loads(metadata_result)

    # Continue workflow with limitations
    # - Skip async task resumption (treat as "Not Started")
    # - Skip blocker checks (no issue tracking)
    # - Execute tasks normally, status updates still work

Key Point: When StateBox exists (normal case), you have ALL data in one call. DO NOT call oar_get_release_metadata separately - it's redundant and slower!

2. Determine Current Phase

Based on StateBox task status, identify which phase the release is in:

Phase 1: Initialization

Take ownership
Check CVE tracker bugs
Check RHCOS security alerts (Konflux only)
Trigger push-to-cdn-staging (async)
Start candidate build analysis (parallel)

Phase 2: Waiting for Build Promotion (if build not promoted)

Check Release Controller API for promotion status
Report to user and ask them to re-invoke /release:drive later

Phase 3: Async Task Triggering and Test Evaluation (ENHANCED - if promoted)

TRIGGER async tasks immediately after promotion detected:
- image-consistency-check
- stage-testing
In parallel with async tasks:
- Wait for test result file creation
- Wait for test aggregation
- Analyze test results (if accepted == false)
- Perform gate check

Phase 4: Final Sync Point (if gate passed and async tasks complete)

Wait for all 3 async tasks to complete:
- push-to-cdn-staging (from Phase 1)
- image-consistency-check (from Phase 3)
- stage-testing (from Phase 3)

Phase 5: Final Approval (if all async tasks passed)

Run image-signed-check
Run change-advisory-status
Call mcp tool oar_is_release_shipped to verify if all the release resources are in correct state

3. Task Execution Pattern

For EACH task you execute:

# Execute MCP tool
result = mcp_tool(release=release, ...)

# Parse stdout for status
if "status is changed to [Pass]" in result:
    Log success and proceed to next task
elif "status is changed to [Fail]" in result:
    Report failure, STOP pipeline
elif "status is changed to [In Progress]" in result:
    Report to user, ask to check back later

4. Build Test Analysis Tasks (Special Handling)

Why These Tasks Are Different:

Tasks analyze-candidate-build and analyze-promoted-build have NO dedicated OAR commands. They require AI-driven analysis because:

Test results stored externally in GitHub (_releases/ocp-test-result-*.json on record branch)
Complex decision logic needed (BO3 verification, failure categorization, waiver assessment)
Only AI can evaluate whether accepted: false should be waived or rejected

CRITICAL: Read Full Execution Steps

For complete step-by-step logic, read docs/KONFLUX_RELEASE_FLOW.md:

Section 6: analyze-candidate-build (lines 471-548)
Section 7: analyze-promoted-build (lines 550-643)

Decision Flow:

1. Fetch test result JSON from GitHub record branch (see KONFLUX_RELEASE_FLOW.md)
2. Check "aggregated": true (tests completed)
3. Check "accepted" field:

   IF accepted == true:
       → All tests passed BO3 verification
       → oar_update_task_status(release, task_name, "Pass",
             result="All blocking tests passed BO3 verification")
       → Continue pipeline

   IF accepted == false:
       → Trigger: /ci:analyze-build-test-results {build}
       → Present AI analysis to user

       IF AI recommendation == "RECOMMEND ACCEPT":
           → Present: "AI Analysis: Failures appear waivable (flaky, infra, known issues)"
           → Present: "Details: {AI summary}"
           → Ask user: "Accept this build? (y/n)"

           IF user accepts:
               → oar_update_task_status(release, task_name, "Pass",
                     result="Waivable failures accepted: {AI summary}")
               → Continue pipeline

           IF user rejects:
               → oar_add_issue(release,
                     issue="Test failures rejected by release lead: {summary}",
                     blocker=True,
                     related_tasks=[task_name])
               → oar_update_task_status(release, task_name, "Fail",
                     result="Rejected by release lead: {AI summary}")
               → STOP pipeline

       IF AI recommendation == "RECOMMEND REJECT":
           → Present: "⚠️ AI Analysis: Critical blockers detected"
           → Present: "Details: {AI summary}"
           → Ask user: "Override AI recommendation and accept anyway? (y/n)"

           IF user overrides (accepts):
               → Ask user: "Please provide justification for override:"
               → User provides: {justification}
               → oar_update_task_status(release, task_name, "Pass",
                     result="OVERRIDE: {justification}\n\nAI Analysis: {AI summary}")
               → Continue pipeline

           IF user confirms rejection:
               → oar_add_issue(release,
                     issue="Release blocker: {AI summary}",
                     blocker=True,
                     related_tasks=[task_name])
               → oar_update_task_status(release, task_name, "Fail",
                     result="Release blocker confirmed: {AI summary}")
               → STOP pipeline

Evaluation Criteria (from /ci:analyze-build-test-results):

CAN WAIVE if: ✅ Flaky tests, ✅ Infrastructure issues, ✅ Test automation bugs, ✅ Known OCPBUGS, ✅ Platform-specific non-critical

CANNOT WAIVE if: ❌ Product bugs, ❌ Cross-platform failures, ❌ Critical features affected, ❌ New unknown failures

Important:

AI recommendation is advisory only - release lead makes final decision
Release lead may have additional context not visible to AI
Only create issues for actual blockers (user confirms rejection)
Store all analysis + decisions in task.result for audit trail

StateBox Integration:

Task status + result: oar_update_task_status(release, task_name, status, result)
Blocking issues: oar_add_issue(blocker=True) only when build rejected

Async Task Monitoring:

Re-execute the same MCP tool to check status
Example: oar_image_consistency_check(release, job_id="uuid") to check progress

Key Decision Points

Build Promotion Checkpoint

IMPORTANT: WebFetch tool doesn't work with OpenShift Release Dashboard API. Use Bash tool with curl command instead.

Check promotion status:

curl -s "https://amd64.ocp.releases.ci.openshift.org/api/v1/releasestream/4-stable/release/{release}" | jq -r '.phase'

Expected output:

"Accepted" → Build is promoted, proceed to Phase 3 (trigger async tasks)
Other values (e.g., "Pending", "Rejected") → Build not yet promoted

Decision logic:

IF phase != "Accepted":
    Report: "Build not yet promoted (current phase: {phase}), check again in 30 minutes"
    Ask user to re-invoke /release:drive later
    RETURN

IF phase == "Accepted":
    Report: "Build promoted successfully (phase: Accepted)"
    Proceed to Phase 3: Trigger async tasks (image-consistency-check, stage-testing)

Gate Check (Critical - ENHANCED)

# ENHANCED: Async tasks already triggered after build promotion
# Gate check now happens in parallel with async task execution

# After promoted build test analysis completes
if promoted_build_analysis == "Pass":
    # Async tasks already running - proceed to wait for completion
    Wait for all 3 async tasks: push-to-cdn-staging, image-consistency-check, stage-testing
else:
    Report blocking failures
    Update overall status to "Red"
    Note: Async tasks may still be running but pipeline cannot proceed
    STOP pipeline

User Communication

When tasks are running:

Tell user which tasks completed successfully
Tell user which tasks are in progress
Tell user estimated time to check back

When waiting for external events:

Clearly explain what we're waiting for (build promotion, test aggregation, etc.)
Provide estimated wait time
Ask user to re-invoke /release:drive {release} later

When errors occur:

Report specific error details
Indicate whether it's transient (retry) or permanent (manual intervention)
Provide next steps for resolution

Example Invocation

/release:drive 4.20.1

AI will:

Check current state via oar_get_release_status
Determine which phase we're in
Execute next pending tasks
Report progress and next steps to user

Important Notes

Read the full spec: All detailed logic is in docs/KONFLUX_RELEASE_FLOW.md
Don't assume: Always check actual task status before executing
Be transparent: Tell user exactly what you're doing and why
Handle failures gracefully: Provide clear error messages and recovery steps
Respect async operations: Don't block on long-running tasks, tell user to check back

Error Recovery

If you encounter errors:

Check docs/KONFLUX_RELEASE_FLOW.md Troubleshooting Guide
Report error details to user
Suggest manual intervention steps if needed
Don't retry destructive operations without user confirmation

StateBox Workflow Resumption

IMPORTANT: StateBox provides AI-driven workflow resumption across multiple sessions.

State Retrieval

Always retrieve StateBox state at the start:

# Get complete release state (metadata, tasks with results, issues)
state = oar_get_release_status(release="4.20.1")

StateBox state structure:

{
  "release": "4.20.1",
  "created_at": "2025-01-15T10:30:00Z",
  "updated_at": "2025-01-15T14:45:00Z",
  "metadata": {
    "jira_ticket": "ART-12345",
    "advisory_ids": {"rpm": 12345, "rhcos": 12346},
    "release_date": "2025-Nov-04",
    "candidate_builds": {"x86_64": "4.20.0-0.nightly-..."},
    "shipment_mr": "https://gitlab.com/..."
  },
  "tasks": [
    {
      "name": "take-ownership",
      "status": "Pass",
      "started_at": "2025-01-15T10:35:00Z",
      "completed_at": "2025-01-15T10:36:00Z",
      "result": "Ownership assigned to rioliu@redhat.com..."
    },
    {
      "name": "image-consistency-check",
      "status": "In Progress",
      "started_at": "2025-01-15T14:00:00Z",
      "completed_at": null,
      "result": "Prow job triggered..."
    }
  ],
  "issues": [
    {
      "issue": "CVE-2024-12345 not covered in advisory",
      "blocker": true,
      "related_tasks": ["check-cve-tracker-bug"],
      "reported_at": "2025-01-15T12:00:00Z",
      "resolved": false,
      "resolution": null
    }
  ]
}

Task Resumption Logic

For EACH task in the workflow, apply this decision tree:

# Parse state to get task information
state_data = json.loads(state)

# Handle limited mode (Google Sheets fallback)
if state_data.get("source") == "worksheet":
    # Limited mode - only have task status, no results or issues
    tasks = state_data.get("tasks", {})
    task_status = tasks.get(task_name, "Not Started")

    if task_status == "Pass":
        Log: f"✓ {task_name} already completed"
        Continue to next task
    elif task_status == "In Progress":
        # No build numbers available - treat as interrupted
        Log: f"⚠ {task_name} was interrupted (limited mode), retrying..."
        Execute task_name
    elif task_status == "Fail":
        # No blocker tracking - ask user
        Log: f"✗ {task_name} previously failed"
        Ask user: "Retry failed task? (y/n)"
        if yes: Execute task_name
        else: STOP
    else:
        # Not started - execute normally
        Execute task_name

    RETURN

# Full mode (StateBox) - complete decision tree
tasks = {t["name"]: t for t in state_data["tasks"]}
task = tasks.get(task_name)

if not task or task["status"] == "Not Started":
    # Check for general blockers before starting
    issues = [i for i in state.get("issues", [])
              if i.get("blocker") and not i.get("resolved")
              and not i.get("related_tasks")]

    if issues:
        Log: "✗ Release blocked by general issues:"
        for issue in issues:
            Log: f"  - {issue['issue']}"
        Ask user to resolve blockers
        STOP pipeline

    # Execute task normally
    Execute task_name

elif task["status"] == "Pass":
    # Skip completed tasks
    Log: f"✓ {task_name} already completed at {task['completed_at']}"
    Continue to next task

elif task["status"] == "In Progress":
    # Check if async task (Prow/Jenkins jobs)
    if task_name in ["image-consistency-check", "stage-testing"]:
        # Extract job ID from task result
        # image-consistency-check uses Prow job ID, stage-testing uses Jenkins build number
        job_id = extract_from_result(task["result"], r"job ID: (\S+)") or extract_from_result(task["result"], r"Build number: (\d+)")

        if not job_id:
            Log: f"⚠ {task_name} in progress but no job ID found, retrying..."
            Execute task_name
        else:
            # Query job status (Prow or Jenkins depending on task)
            result = execute_mcp_tool(task_name, job_id=job_id)

            if "status is changed to [Pass]" in result:
                Log: f"✓ {task_name} completed successfully"
                Continue to next task
            elif "status is changed to [Fail]" in result:
                Log: f"✗ {task_name} failed"
                STOP pipeline
            else:
                Log: f"⏳ {task_name} still running (job {job_id})"
                Ask user to check back later
                RETURN
    else:
        # Non-async task stuck in progress - retry
        Log: f"⚠ {task_name} was interrupted, retrying..."
        Execute task_name

elif task["status"] == "Fail":
    # Check if task has unresolved blocker
    task_issues = [i for i in state.get("issues", [])
                   if i.get("blocker") and not i.get("resolved")
                   and task_name in i.get("related_tasks", [])]

    if task_issues:
        Log: f"✗ {task_name} blocked by:"
        for issue in task_issues:
            Log: f"  - {issue['issue']}"
        Ask user to resolve blockers and re-run /release:drive
        STOP pipeline
    else:
        # Blocker resolved or no blocker - retry task
        Log: f"↻ Retrying {task_name} (previous failure)..."
        Execute task_name

Async Task Monitoring

For long-running async tasks:

# Initial trigger (when task doesn't exist or has no job ID)
result = oar_image_consistency_check(release=release)

if "Prow job" in result:
    job_id = extract_job_id(result)
    Log: f"⏳ Prow job {job_id} triggered"
    Log: "Check back in 20-30 minutes with: /release:drive {release}"
    RETURN

# Status check on resume (when task has job ID in result)
result = oar_image_consistency_check(release=release, job_id=job_id)

if "status is changed to [Pass]" in result:
    Log: f"✓ Job {job_id} completed successfully"
    Continue to next task
elif "status is changed to [Fail]" in result:
    # Add issue to StateBox
    oar_add_issue(
        release=release,
        issue=f"image-consistency-check Prow job {job_id} failed: {extract_failure_reason(result)}",
        blocker=True,
        related_tasks=["image-consistency-check"]
    )
    Log: "✗ Job failed, blocker added to StateBox"
    STOP pipeline
else:
    Log: f"⏳ Job {job_id} still running..."
    RETURN

Issue Tracking Integration

Adding blocking issues:

# When you encounter a blocking problem during execution
oar_add_issue(
    release=release,
    issue="CVE-2024-12345 not covered in advisory",
    blocker=True,
    related_tasks=["check-cve-tracker-bug"]
)

Resolving issues (typically done by user manually):

# User fixes the problem, then resolves via MCP tool
oar_resolve_issue(
    release=release,
    issue="CVE-2024-12345",  # Supports partial/fuzzy matching
    resolution="Added CVE to advisory #12345, ART confirmed coverage"
)

# Next /release:drive invocation will retry the task

Checking for blockers before starting workflow:

state = oar_get_release_status(release=release)

# Check for unresolved blocking issues
blockers = [i for i in state.get("issues", [])
            if i.get("blocker") and not i.get("resolved")]

if blockers:
    Log: "⚠ Found unresolved blocking issues:"
    for issue in blockers:
        related = issue.get("related_tasks", [])
        if related:
            Log: f"  - {issue['issue']} (affects: {', '.join(related)})"
        else:
            Log: f"  - {issue['issue']} (GENERAL BLOCKER - affects entire release)"

    Ask user: "Some tasks are blocked. Continue anyway? (y/n)"
    if user says no:
        STOP

Workflow Phase Detection

Use StateBox task statuses to determine current phase:

state = oar_get_release_status(release=release)
tasks = {t["name"]: t["status"] for t in state["tasks"]}

# Determine phase based on task completion
if tasks.get("take-ownership") != "Pass":
    phase = "PHASE 1: Initialization"
    next_steps = ["take-ownership", "check-cve-tracker-bug", ...]

elif not is_build_promoted(release):
    phase = "PHASE 2: Waiting for Build Promotion"
    next_steps = ["Check Release Controller API in 30 min"]

elif tasks.get("analyze-promoted-build") != "Pass":
    phase = "PHASE 3: Test Evaluation & Async Task Triggering"
    next_steps = ["Trigger async tasks", "Analyze test results"]

elif not all_async_tasks_pass(tasks):
    phase = "PHASE 4: Waiting for Async Tasks"
    pending = [t for t in ["push-to-cdn-staging", "image-consistency-check", "stage-testing"]
               if tasks.get(t) != "Pass"]
    next_steps = [f"Wait for {', '.join(pending)}"]

else:
    phase = "PHASE 5: Final Approval"
    next_steps = ["image-signed-check", "change-advisory-status"]

Log: f"Current Phase: {phase}"
Log: f"Next Steps: {next_steps}"

Multi-Session Resumption Example

Session 1 (interrupted after triggering async tasks):

User: /release:drive 4.20.1
AI: Loading StateBox state for 4.20.1...
AI: Current Phase: PHASE 1 - Initialization
AI: ✓ take-ownership completed
AI: ✓ check-cve-tracker-bug completed
AI: ⏳ push-to-cdn-staging triggered (job #456)
AI: Build not yet promoted, check back in 30 minutes

Session 2 (hours later, build promoted):

User: /release:drive 4.20.1
AI: Loading StateBox state for 4.20.1...
AI: Resuming from PHASE 2...
AI: ✓ Skipping 2 completed tasks (take-ownership, check-cve-tracker-bug)
AI: ⏳ push-to-cdn-staging still running (job #456)
AI: ✓ Build promoted! Phase: PHASE 3 - Test Evaluation
AI: ⏳ image-consistency-check triggered (Prow job abc-123-def)
AI: ⏳ stage-testing triggered (Jenkins job #790)
AI: Waiting for test results, check back in 1 hour

Session 3 (after async tasks complete):

User: /release:drive 4.20.1
AI: Loading StateBox state for 4.20.1...
AI: Resuming from PHASE 4...
AI: ✓ Skipping 4 completed tasks
AI: ✓ push-to-cdn-staging completed (job #456)
AI: ✓ image-consistency-check completed (Prow job abc-123-def)
AI: ✓ stage-testing completed (Jenkins job #790)
AI: Analyzing promoted build test results...
AI: ✓ All tests passed, proceeding to PHASE 5
AI: ✓ image-signed-check completed
AI: ✓ change-advisory-status completed
AI: 🎉 Release 4.20.1 approved!

Error Recovery

When task fails with error:

try:
    result = execute_mcp_tool(task_name, release=release)

    if "status is changed to [Fail]" in result:
        # Task failed - determine if blocker should be added
        if is_permanent_failure(result):
            # Add blocking issue
            oar_add_issue(
                release=release,
                issue=f"{task_name} failed: {extract_error(result)}",
                blocker=True,
                related_tasks=[task_name]
            )
            Log: f"✗ {task_name} failed, blocker added"
            Log: "Please investigate and resolve, then re-run /release:drive"
            STOP
        else:
            # Transient failure - will retry on next invocation
            Log: f"⚠ {task_name} failed (transient), will retry on next invocation"
            RETURN

except Exception as e:
    # Unexpected error - add general blocker
    Log: f"✗ Unexpected error in {task_name}: {e}"
    oar_add_issue(
        release=release,
        issue=f"Unexpected error in {task_name}: {str(e)}",
        blocker=True,
        related_tasks=[task_name]
    )
    STOP

Key Principles

Idempotency: Re-running /release:drive multiple times is safe
- Completed tasks (Pass) are skipped
- In-progress async tasks are checked, not re-triggered
- Failed tasks are retried only after blockers resolved
StateBox vs Google Sheets:
- StateBox: Primary source of truth for AI
  - Complete state (tasks + results + issues)
  - AI-readable task results for context
  - Issue tracking for blockers
- Google Sheets: Still updated for backwards compatibility
  - Task status only (Pass/Fail/In Progress)
  - Human-readable format for manual review
State Priority:
- Always check StateBox first (should exist for all active releases)
- Fall back to Google Sheets only if StateBox doesn't exist (abnormal)
Session Independence:
- Never rely on previous conversation context
- Always load StateBox state at session start
- StateBox persists across days, weeks, or machine restarts

Remember: StateBox enables true workflow resumption. Always check state first, respect task statuses, and track blockers properly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CRITICAL: Validate Release Version

Complete Workflow Specification

Quick Reference

1. Retrieve Release State from StateBox

2. Determine Current Phase

3. Task Execution Pattern

4. Build Test Analysis Tasks (Special Handling)

Key Decision Points

Build Promotion Checkpoint

Gate Check (Critical - ENHANCED)

User Communication

Example Invocation

Important Notes

Error Recovery

StateBox Workflow Resumption

State Retrieval

Task Resumption Logic

Async Task Monitoring

Issue Tracking Integration

Workflow Phase Detection

Multi-Session Resumption Example

Error Recovery

Key Principles

FilesExpand file tree

drive.md

Latest commit

History

drive.md

File metadata and controls

CRITICAL: Validate Release Version

Complete Workflow Specification

Quick Reference

1. Retrieve Release State from StateBox

2. Determine Current Phase

3. Task Execution Pattern

4. Build Test Analysis Tasks (Special Handling)

Key Decision Points

Build Promotion Checkpoint

Gate Check (Critical - ENHANCED)

User Communication

Example Invocation

Important Notes

Error Recovery

StateBox Workflow Resumption

State Retrieval

Task Resumption Logic

Async Task Monitoring

Issue Tracking Integration

Workflow Phase Detection

Multi-Session Resumption Example

Error Recovery

Key Principles