[CP 1300] Add Claude Code knowledge base and component-specific pytest skills by ci-penbot-01 · Pull Request #518 · ROCm/gpu-operator

ci-penbot-01 · 2026-04-10T08:27:30Z

cp of pensando/gpu-operator#1300

Source PR Description (pensando/gpu-operator#1300):

Summary

Adds comprehensive Claude Code AI assistance infrastructure for GPU operator testing, including knowledge base organization and component-specific pytest development skills.

Key Features

1. Knowledge Base Structure (`.claude/knowledge/`)

Product Knowledge: GPU operator architecture, component behavior, deployment patterns
Testing Knowledge: Component-specific test patterns and debugging guides
PRDs: Product requirements documents for feature development

2. Component-Specific pytest Skills (`.claude/skills/`)

Created 6 specialized skills for implementing testcases from approved test plans:

pytest-dcm-dev: Device Config Manager partition testing (SPX/DPX/QPX/CPX profiles)
pytest-dme-dev: Device Metrics Exporter validation (AMD-SMI, Prometheus endpoints)
pytest-npd-dev: Node Problem Detector integration testing
pytest-anr-dev: Auto Node Remediation workflows (Argo-based remediation)
pytest-driver-dev: ROCm/AMDGPU driver version management and upgrades
pytest-upgrade-dev: GPU operator and operand upgrade testing (base → RC)

3. Test Development Workflow

PRD → /test-plan-dev → Approval → /pytest-{component}-dev → Implementation

Each component skill includes:

DeviceConfig CR patterns
Test structure and key functions
Debugging common issues
Platform differences (K8s vs OpenShift)
Knowledge base references

Testing

Skills tested with existing pytest infrastructure under tests/pytests/k8/
Knowledge base validated against current test implementations
Workflow tested with metrics-exporter and DCM partition features

Documentation

All skills include usage examples and quick reference sections
.claude/skills/README.md documents component-to-skill mapping
.claude/knowledge/MIGRATION.md tracks organization history

🤖 Generated with Claude Code

Cherrypick triggered by: ACP-Automation

…#1300) * Add KB structure and pytest-dev skill Create knowledge base structure for GPU operator: - kb_source/common/skills/ - Common skills for all agents - Added pytest-dev.md - Specialized pytest development agent - Added README.md documenting skills and structure The pytest-dev skill provides: - Test writing following project patterns - Debugging from CI job logs - Test infrastructure navigation - Cross-platform testing (K8s + OpenShift) - Supports gpu-operator, exporter, gpu-dra, and amd-metrics-exporter Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Enhance pytest-dev skill with component-specific sections Add detailed component-specific testing patterns for: - GPU Operator Helm Chart - Overall deployment and lifecycle - Config Manager - GPU partitioning (SPX, DPX, QPX, CPX, NPS) - Metrics Exporter - Prometheus integration and metric validation - Device Plugin - GPU resource allocation to pods - Node Problem Detector (NPD) - Health monitoring and conditions - Node Remediation (ANR) - Argo Workflows for auto-remediation - Test Runner - RVS validation suite execution - DRA Driver - Dynamic Resource Allocation - Node Labeller - Automatic GPU-based node labeling - Standalone Tests - Debian and Docker deployments Each section includes: - File locations and key utilities - Component purpose and features tested - Key concepts and patterns - Common test structure examples - Integration points with other components This makes the skill more actionable for component-specific development and debugging. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Restructure testing skills: separate test planning from implementation Split pytest development into two-stage workflow: 1. **test-plan-dev** skill - PRD analysis and test plan generation - Analyzes PRDs and extracts testable requirements - Creates structured test plans with scenario mapping - Defines test coverage matrix and priorities - Generates reviewable documents for stakeholder approval - Output: Test plan (NO code implementation) 2. **pytest-dev** skill - Testcase implementation - Implements actual pytest code from approved test plans - Writes test functions, fixtures, and utilities - Debugs test failures and maintains test infrastructure - Updated to clarify it works AFTER plan approval Benefits: - Clear separation of concerns (planning vs implementation) - Stakeholder review before coding begins - Better alignment with development workflow - Prevents wasted effort on unapproved test strategies Workflow: PRD → /test-plan-dev → review/approve → /pytest-dev → implement Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add test plan for ECC deferred error metrics feature Generated comprehensive test plan from PRD-GPU-20260406-01 using test-plan-dev skill approach. Test Plan Highlights: - 20 test scenarios covering all PRD requirements - Requirements traceability matrix (16 testable requirements) - Test priorities: 8 P0 (blocker), 8 P1 (critical), 4 P2 (normal) - Platform coverage: MI2xx, MI3xx, baremetal, SR-IOV - Test coverage matrix by component, platform, deployment Key Test Areas: - Functional: All 19 metrics collection and naming validation - Accuracy: Exact match validation vs AMD-SMI (zero tolerance) - Configuration: Enable/disable and selective field filtering - Platform: MI210, MI250X, MI300X, MI325X support - Error Injection: metricsclient-based validation - Multi-GPU: Independent metrics per GPU/partition - Negative: Unsupported platforms, invalid configs Test Execution Strategy: - P0 tests (8 scenarios) in pre-merge CI - 10-15 min - Full suite (20 scenarios) in nightly regression - 30-45 min - Automated pytest tests in gpu-operator test infrastructure Next Steps: 1. Stakeholder review and approval of this test plan 2. Use /pytest-dev skill to implement approved test scenarios 3. Execute tests and validate feature Status: Draft - Pending Review Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Generate fresh test plan for ECC deferred error metrics Created comprehensive test plan from PRD-GPU-20260406-01: - 20 test scenarios covering all PRD requirements - 17 requirements mapped via traceability matrix - 3 priority levels: 13 P0, 6 P1, 1 P2 - 100% requirements coverage target Test areas: - Functional validation (metrics export, naming, labels) - Accuracy validation (AMD-SMI ground truth) - Configuration testing (enable/disable, auto-reload) - Platform testing (MI2xx, MI3xx, SR-IOV, partitions) - Multi-GPU and partitioned GPU scenarios - Static metric behavior and error injection - Health service integration (negative test) - Documentation validation Key features: - All tests designed for automation (7-9 hours execution) - References to kb_source/exporter/gpu-metrics-details.md - Partition 0 limitation properly addressed - metricsclient for safe ECC error injection Ready for stakeholder review and approval. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add knowledge base for device-metrics-exporter testing Created two knowledge base documents to support test plan development: 1. device-metrics-exporter.md - Testing infrastructure - Metrics endpoint ports: K8s ClusterIP (5000), NodePort (32500), Bare metal (2112) - Test workflow for metric validation - AMD-SMI ground truth: Use 'amd-smi metric --json' (not '-e') - Test architecture: gpu-agent and amd-smi are INSIDE exporter pod - collect_metrics_samples() function usage from metric_util.py - Validation workflow: exec into pod for AMD-SMI, HTTP GET for exporter metrics 2. platform-support.md - Platform coverage requirements - Supported GPU models: MI355X, MI350X, MI325X, MI300X, MI250, MI210 - Kubernetes versions: 1.29 - 1.35 - OpenShift versions: 4.20, 4.21 (docs show 4.16-4.20) - OS support: Ubuntu 22.04/24.04 LTS, Debian 12, RHCOS - Test coverage matrix for test plan creation - Reference: https://instinct.docs.amd.com/projects/gpu-operator/en/latest/ These knowledge base files will be used by test-plan-dev and pytest-dev skills to ensure accurate test plan generation and pytest implementation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update test plan with correct testing infrastructure and platform support Updated PRD-GPU-20260406-01-TEST-PLAN.md to align with actual testing practices: **Port Corrections**: - Changed localhost:2112 → <node-ip>:32500 (Kubernetes NodePort) - Added note about ClusterIP port 5000 for kubectl port-forward - Updated all curl commands to use correct K8s ports **AMD-SMI Command Corrections**: - Changed 'amd-smi metric -e' → 'amd-smi metric --json' - Added kubectl exec commands to run AMD-SMI inside exporter pod - Updated data flow to show AMD-SMI runs inside pod (not on host) **Test Architecture Updates**: - Documented that gpu-agent and amd-smi are INSIDE exporter pod - Added reference to collect_metrics_samples() in metric_util.py - Updated test workflow to use kubectl exec for AMD-SMI commands - Added network access requirements (kubectl/oc, K8s API server) **Platform Support Updates**: - Added all 6 GPU models: MI355X, MI350X, MI325X, MI300X, MI250/MI250X, MI210 - Updated Kubernetes support: 1.29-1.35 (was 1.28+) - Updated OpenShift support: 4.20, 4.21 (was 4.14+) - Added platform coverage matrix with all GPU generations - Added Kubernetes/OpenShift version coverage table - Reference: https://instinct.docs.amd.com/projects/gpu-operator/en/latest/ **Knowledge Base References**: - Added references to kb_source/common/device-metrics-exporter.md - Added references to kb_source/common/platform-support.md - Referenced collect_metrics_samples() function pattern These updates ensure the test plan accurately reflects: - Kubernetes deployment architecture (not standalone/bare metal) - Actual AMD-SMI usage patterns from existing tests - Complete platform support matrix for GPU operator Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add AMD-SMI JSON structure and metrics mapping to knowledge base Updated knowledge base and skills with AMD-SMI ground truth validation: **device-metrics-exporter.md**: - Added AMD-SMI JSON output structure and hierarchy - ECC deferred error metrics locations in JSON: * Total: gpu_data[N].ecc.total_deferred_count * Per-block: gpu_data[N].ecc_blocks.<BLOCK>.deferred_count - Documented source of truth files location: * Sample: /home/srivatsa/jobd-logs/<job-id>/logs/idle_<GPU>_smi_metrics_*.json * Reference: idle_MI210_smi_metrics_4.json (lines 287-325) - Added metrics-support.json mapping structure: * Location: tests/pytests/lib/files/metrics-support.json * Maps: exporter metric → AMD-SMI JSON path + GPU Agent proto field * Example: GPU_ECC_DEFERRED_UMC → ecc_blocks.UMC.deferred_count - Documented usage in tests for metric validation **test-plan-dev.md**: - Added AMD-SMI Integration section - Documented AMD-SMI JSON output structure for reference - Added metrics-support.json mapping explanation - Defined test validation pattern: 1. Exec into exporter pod for AMD-SMI JSON 2. Parse JSON value from documented path 3. Query exporter metrics endpoint 4. Compare values (must match exactly) - Reference to device-metrics-exporter.md for detailed structure **pytest-dev.md**: - Added Metrics Testing Patterns section - Documented collect_metrics_samples() function usage - Explained metrics-support.json structure and purpose - When adding new metrics checklist: 1. Add entry to metrics-support.json 2. Specify AMD-SMI JSON path 3. Specify GPU Agent proto field 4. List supported GPU models 5. Set skip-validation flag - Added metric validation workflow pattern - Updated task approach to include metrics mapping check These updates ensure test plans and pytest implementation use correct: - AMD-SMI command (metric --json not -e) - JSON path for extracting values - Metrics mapping file for validation - Ground truth sample data from job logs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: reorganize knowledge base to separate product and testing knowledge Restructured kb_source/ to clearly separate three types of knowledge: 1. Product knowledge (how features work) - products/ 2. Testing knowledge (how to test features) - testing/ 3. Skills (complete workflows) - skills/ New Structure: - products/ # Product behavior and architecture (placeholders) - dcm/ # Device Config Manager product knowledge - npd/ # Node Problem Detector product knowledge - testing/ # Test patterns and debugging - dcm/ # DCM test KB (4 entries migrated) - npd/ # NPD test KB (to be added) - common/ # Generic patterns (to be added) - skills/ # Complete workflows - pytest-dcm-dev.md # DCM development skill - pytest-npd-dev.md # NPD development skill Migrated Content: - common/kb/dcm/* → testing/dcm/ - cleanup-on-failure.md - driver-reload-timing.md - partition-profile-files.md - verify-label-multi-node.md - common/skills/* → skills/ - pytest-dcm-dev.md - pytest-npd-dev.md Documentation Added: - README.md - Main overview with navigation guide - MIGRATION.md - What changed and how to use new structure - products/README.md - Product knowledge overview (placeholder) - testing/README.md - Testing knowledge overview - Feature-specific READMEs in subdirectories Benefits: - Clear separation of concerns (product vs test vs workflow) - Easy to find relevant information by type - Scalable for new features (DRA, metrics-exporter, etc.) - Cross-referenced for easy navigation Cherry-picked from ca3c5239f01e4f24f5410733dc512b26c069348c Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: reorganize kb_source to .claude for standard Claude Code structure Moved all Claude Code configuration from kb_source/ to .claude/ following standard conventions for project-specific Claude Code organization. New Structure: - .claude/skills/ # Executable workflows (invoked with /skill-name) - pytest-dev.md # Generic pytest development - pytest-dcm-dev.md # DCM-specific testing - pytest-npd-dev.md # NPD integration testing - test-plan-dev.md # Test plan generation from PRDs - README.md # Skills documentation - .claude/knowledge/ # Reference documentation (not executable) - products/ # Product behavior and architecture - testing/ # Test patterns and debugging guides - prds/ # Product requirements documents - device-metrics-exporter.md - platform-support.md - README.md # Knowledge base overview - MIGRATION.md # Migration history - .claude/agents/ # Custom agent definitions (placeholder) Benefits: - Standard .claude/ location aligns with global ~/.claude/ structure - Clear separation: executable skills vs reference knowledge - Easier discovery and integration with Claude Code - Follows established conventions for project configuration All content from kb_source/ has been preserved and reorganized. Skills are now invocable with /skill-name syntax. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * feat: add component-specific pytest skills for DME and ANR Created specialized pytest development skills for all GPU Operator components: New Skills: - pytest-dme-dev.md (7.7K) - Device Metrics Exporter testing * Metrics collection and validation against AMD-SMI * Endpoint connectivity (ClusterIP vs NodePort) * Prometheus format validation * Test files: test_metrics_exporter.py, test_metrics_values.py - pytest-anr-dev.md (12K) - Auto Node Remediation testing * Argo Workflow execution and monitoring * NHC (Node Health Check) integration * Custom ConfigMap templates * K8s-only (not supported on OpenShift) * Test files: test_node_remediation.py, test_anr_deployment.py Complete Component Coverage: - DCM (Device Config Manager) → pytest-dcm-dev - DME (Device Metrics Exporter) → pytest-dme-dev - NPD (Node Problem Detector) → pytest-npd-dev - ANR (Auto Node Remediation) → pytest-anr-dev - Generic / Multi-component → pytest-dev Updated README.md: - Component-to-skill mapping table - Enhanced workflow with component selection - Test file locations for each operand - Full workflow example Skills are designed to work with test plans generated by /test-plan-dev. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * feat: add ROCm/AMDGPU driver version management skill Created pytest-driver-dev skill for ROCm/AMDGPU driver testing and version management: Key Features: - Driver deployment testing (DeviceConfig vs Inbox modes) - Driver version upgrades and downgrades - KMM (Kernel Module Management) workflow debugging - Driver blacklist configuration validation - Driver-deviceplugin integration testing Driver Version Management: - Driver spec files in tests/pytests/lib/files/ - Symlink: amd-deviceconfig-default-driver-spec.json → current default - Update process documented for new ROCm releases - ROCm-to-driver version mapping table Test Coverage: - test_driver_deviceplugin.py (primary test file) - test_node_driver_version() - Version validation - test_driver_upgrade_cycle() - Upgrade testing - test_driver_blacklist_* - Blacklist configuration - test_upgrade_driver_using_label() - Node-specific upgrades Important for Maintenance: - Update symlink when new ROCm version is released - Create new driver spec file for each ROCm version - Keep alternative-versions list updated - Document ROCm-to-driver version mapping Updated README.md: - Added driver skill to component mapping - Documented driver spec file management Complete skill coverage: DCM, DME, NPD, ANR, Driver + Generic Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * feat: add GPU operator and operand upgrade testing skill Created pytest-upgrade-dev skill for GPU operator upgrade testing: Upgrade Testing Coverage: - GPU operator helm upgrade (base version → RC) - Operand (component) upgrade workflows - RollingUpdate vs OnDelete strategies - Upgrade hooks and CRD patching - Multi-version upgrade matrix testing Supported Base Versions: - v1.4.1, v1.4.0, v1.3.1, v1.3.0 - v1.2.2, v1.2.1, v1.2.0 - v1.1.0, v1.0.0 Job Configuration: - tests/jobs/upgrade/.job.yml defines upgrade scenarios - One job per base version → RC upgrade path - All operand dependencies specified per job Test Phases: 1. Deploy base version operator + DeviceConfig 2. Validate base deployment 3. Perform helm upgrade to RC 4. Validate RC deployment 5. Test operand-only upgrades (optional) Upgrade Strategies: - RollingUpdate: Progressive pod replacement (default) - OnDelete: Manual pod deletion triggers upgrade - Configurable maxUnavailable for RollingUpdate Key Components: - Pre-upgrade hooks (validate, block if driver operations active) - CRD auto-upgrade hooks - Base/RC version fixtures - Operand version information from lib/amdgpu.py Documentation: - https://instinct.docs.amd.com/projects/gpu-operator/en/latest/upgrades/upgrade.html - https://instinct.docs.amd.com/projects/gpu-operator/en/latest/upgrades/componentupgrades.html Updated README.md: - Added upgrade skill to component mapping - Documented upgrade workflow and strategies - Listed supported base versions Complete skill coverage: DCM, DME, NPD, ANR, Driver, Upgrade + Generic Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * refactor: emphasize test plan implementation as primary use in all component skills Updated all component-specific pytest skills to consistently emphasize implementing testcases from approved test plans as the primary use case. Changes to all skills: - Added 'Primary Use' statement in Purpose section - First capability listed as implementing from test plans - Clarified workflow: /test-plan-dev → approval → /pytest-*-dev Updated skills: - pytest-dcm-dev.md (Device Config Manager) - pytest-dme-dev.md (Device Metrics Exporter) - pytest-npd-dev.md (Node Problem Detector) - pytest-anr-dev.md (Auto Node Remediation) - pytest-driver-dev.md (ROCm/AMDGPU Driver) - pytest-upgrade-dev.md (Operator & Upgrades) This ensures consistency across all component skills and reinforces the intended workflow where test plans are generated first, then implemented using component-specific skills. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * fix: resolve markdown linting errors in skill documentation Fixed all markdown linting errors in skill README and test-plan-dev: - Added blank lines around lists (MD032) - Added blank lines around fenced code blocks (MD031) - Added language specifiers to code blocks (MD040) - Added blank lines around headings (MD022) - Fixed table column alignment (MD060) - Fixed ordered list numbering (MD029) All doc-lint errors resolved. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> fix: resolve all remaining markdown linting errors in skills Fixed markdown linting errors across all skill documentation files: pytest-upgrade-dev.md: - Added blank lines around headings (MD022) - Added blank lines around fenced code blocks (MD031) - Added blank lines around lists (MD032) README.md: - Added blank lines around all headings (MD022) - Added blank lines around all lists (MD032) - Added blank lines around fenced code blocks (MD031) - Added language specifiers to code blocks (MD040) - Fixed table column alignment (MD060) - Fixed ordered list numbering (MD029) - Removed multiple consecutive blank lines (MD012) README.common-skills.md: - Added blank lines around headings (MD022) - Added blank lines around lists (MD032) - Added blank lines around fenced code blocks (MD031) - Added language specifiers to code blocks (MD040) - Fixed ordered list numbering (MD029) All doc-lint errors in .claude/skills/ are now resolved. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> fix: resolve final markdown linting errors in all skill files Fixed remaining markdown linting errors across all skill documentation: pytest-npd-dev.md: - Added blank lines around all headings (MD022) - Added blank lines around all lists (MD032) - Added blank lines around code blocks (MD031) pytest-upgrade-dev.md: - Added blank lines around all headings (MD022) - Added blank lines around all lists (MD032) - Added blank lines around all code blocks (MD031) - Fixed "OnDelete" from emphasis to proper heading (MD036) README.common-skills.md: - Removed multiple consecutive blank lines (MD012) README.md: - Added blank line before code block (MD031) All .claude/skills/ files now pass markdown linting with zero errors. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> fix: resolve markdown linting errors in pytest-driver-dev and pytest-npd-dev Fixed markdown linting errors in driver and NPD skill documentation: pytest-driver-dev.md: - Added blank lines around headings (MD022) - Added blank lines around lists (MD032) - Added blank lines around code blocks (MD031) - Fixed table column alignment (MD060) pytest-npd-dev.md: - Added blank lines around all headings (MD022) - Added blank lines around all lists (MD032) - Added blank lines around all code blocks (MD031) - Removed multiple consecutive blank lines (MD012) All skill documentation now passes markdown linting. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> fix: auto-fix markdown linting errors across all .claude/ documentation Applied automated fixes for common markdown linting patterns using fix-markdown-lint-v2.py script: Fixes applied: - MD022: Added blank lines around all headings - MD032: Added blank lines around all lists - MD031: Added blank lines around code blocks - MD040: Added language specifiers to code blocks (with smart detection for json/yaml/python/bash) - MD012: Removed multiple consecutive blank lines Files fixed (24 total): - All .claude/skills/*.md (9 files) - All .claude/knowledge/**/*.md (14 files) - .claude/README.md This fixes approximately 700+ markdown linting errors automatically. Remaining errors (if any) will be table alignment (MD060) which require manual fixes. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> (cherry picked from commit 6fb139f2e2ca2b728ab48bbee429d9f38de59b24)

ci-penbot-01 · 2026-04-10T08:27:33Z

AI-Assisted Cherry-Pick

Source PR: #1300
Target Branch: main

The cherry-pick operation encountered merge conflicts which were resolved automatically using AI assistance.

Files with conflicts (resolved by AI):

.claude/skills/pytest-npd-dev.md:1-239

Original conflict in .claude/skills/pytest-npd-dev.md

File was being added by cherry-pick commit. No conflict markers present - this was an 'added by them' conflict where the file was being added by the incoming commit 6fb139f2. The file was in .gitignore, requiring 'git add -f' to resolve. Resolution: Accepted the incoming version (entire file).

Cherry-pick triggered by: ACP-Automation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CP 1300] Add Claude Code knowledge base and component-specific pytest skills#518

[CP 1300] Add Claude Code knowledge base and component-specific pytest skills#518
ci-penbot-01 wants to merge 1 commit intoROCm:mainfrom
ci-penbot-01:CP.O2O.pensando.gpu-operator.1300.rocm.gpu-operator.main

ci-penbot-01 commented Apr 10, 2026

Uh oh!

ci-penbot-01 commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ci-penbot-01 commented Apr 10, 2026

Summary

Key Features

1. Knowledge Base Structure (.claude/knowledge/)

2. Component-Specific pytest Skills (.claude/skills/)

3. Test Development Workflow

Testing

Documentation

Uh oh!

ci-penbot-01 commented Apr 10, 2026

AI-Assisted Cherry-Pick

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Knowledge Base Structure (`.claude/knowledge/`)

2. Component-Specific pytest Skills (`.claude/skills/`)