Skip to content

Latest commit

 

History

History
509 lines (397 loc) · 15.7 KB

File metadata and controls

509 lines (397 loc) · 15.7 KB

Configuration Guide

This guide covers the configuration file format and options for ISV NCP Validation Suite.

Overview

Configuration files define what tests to run and how to run them. They use YAML format with:

  • Step-based execution - Scripts perform operations and output JSON
  • Schema validation - Output is validated against auto-detected or explicit schemas
  • Advanced validations - Field checks, state verification, cross-step comparisons
  • Phase ordering - Define custom phases that execute in order
  • Centralized validations - All validations in tests.validations section
  • Template variables - Reference step outputs and settings via Jinja2

Architecture

┌──────────────────────────────────────────────────────────────────┐
│                     Step-Based Execution                         │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Config (YAML)          Scripts (Any Language)    Validations    │
│  ┌──────────────┐      ┌──────────────────┐     ┌────────────┐   │
│  │ phases: [...]│      │ provision.py     │     │ Check JSON │   │
│  │ steps:       │─────▶│ create_vpc.py    │────▶│ output for │   │
│  │   - name: x  │      │ launch_vm.sh     │     │ success    │   │
│  │     phase    │      │ check_api.py     │     │            │   │
│  │     command  │      └──────────────────┘     └────────────┘   │
│  └──────────────┘              │                      │          │
│                                │                      │          │
│                                ▼                      ▼          │
│                         JSON Output              Pass/Fail       │
│                         {"success": true,        assertions      │
│                          "vpc_id": "vpc-xxx"}                    │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

Benefits:

  • Language-agnostic - Scripts can be Python, Bash, Go, etc.
  • Simple validations - Just check JSON output fields
  • Reusable scripts - Same script works across configs
  • Easy debugging - Run scripts manually, inspect JSON

Example Configs

Pre-built configs are provided in isvctl/configs/:

Config Description
providers/aws/control-plane.yaml AWS API health, access key lifecycle, tenant management
providers/aws/network.yaml AWS VPC network validation (6 test suites)
providers/aws/vm.yaml AWS EC2 GPU instance tests
providers/aws/iam.yaml AWS IAM user lifecycle
providers/aws/eks.yaml AWS EKS with GPU nodes
tests/k8s.yaml Standard Kubernetes cluster
tests/slurm.yaml Slurm HPC cluster

Basic Usage

# Run a config
isvctl test run -f isvctl/configs/providers/aws/control-plane.yaml

# Merge multiple configs (later files override earlier ones)
isvctl test run -f isvctl/configs/providers/aws/eks.yaml -f my-overrides.yaml

# Verbose output (shows script output on failure)
isvctl test run -f config.yaml -v

# Validate config without running
isvctl test run -f config.yaml --dry-run

# Pass pytest arguments after --
isvctl test run -f config.yaml -- -v -s -k "NodeCount"

Config Structure

Complete Example

version: "1.0"

commands:
  network:
    # Phases execute in this order
    phases: ["setup", "test", "teardown"]

    steps:
      # Step 1: Create VPC (setup phase)
      - name: create_network
        phase: setup
        command: "python ./scripts/create_vpc.py"
        args:
          - "--name"
          - "test-vpc"
          - "--region"
          - "{{region}}"
        timeout: 300

      # Step 2: Run tests (test phase)
      - name: test_connectivity
        phase: test
        command: "python ./scripts/test_connectivity.py"
        args:
          - "--vpc-id"
          - "{{steps.create_network.network_id}}"
        timeout: 600

      # Step 3: Cleanup (teardown phase)
      - name: teardown
        phase: teardown
        command: "python ./scripts/teardown.py"
        args:
          - "--vpc-id"
          - "{{steps.create_network.network_id}}"
        timeout: 300

tests:
  platform: network
  cluster_name: "aws-network-test"

  settings:
    region: "us-west-2"

  # Centralized validations grouped by category
  validations:
    network:
      - NetworkProvisionedCheck:
          step: create_network
      - StepSuccessCheck:
          step: test_connectivity
          check_success: true

    teardown_checks:
      - StepSuccessCheck:
          step: teardown

Platform Configuration

Each platform defines phases and steps:

commands:
  network:
    phases: ["setup", "test", "teardown"]    # Execution order
    steps: [...]                              # Steps grouped by phase
Field Required Description
phases No Ordered list of phases (default: ["setup", "test", "teardown"])
steps Yes List of step configurations
skip No Skip this entire platform

Important: If a step's phase is not in the phases list, an error is raised.

Step Configuration

Each step defines a command to execute:

- name: create_network
  phase: setup
  command: "python ./scripts/create_vpc.py"
  args: ["--region", "{{region}}"]
  timeout: 300
  env:
    AWS_PROFILE: "production"
  skip: false
  continue_on_failure: false
  output_schema: vpc
Field Required Description
name Yes Unique step identifier (used for output references)
phase No Phase this step belongs to (default: setup)
command Yes Script/command to execute
args No Arguments (supports Jinja2 templates)
timeout No Timeout in seconds (default: 300)
env No Environment variables
skip No Skip this step
continue_on_failure No Continue even if this step fails
output_schema No Schema name for output validation

Validation Configuration

Validations are centralized in tests.validations, grouped by category:

tests:
  validations:
    # Category name (any meaningful name)
    network:
      - NetworkProvisionedCheck:
          step: create_network

    # Validations can specify when to run
    teardown_checks:
      - StepSuccessCheck:
          step: teardown

Validation Timing (phase):

Value When it runs
(not set) After setup phase (default)
teardown After teardown phase
<phase> After the specified phase

Test Variants

A validation check can be run multiple times with different parameters by appending a dash-separated suffix to the class name. The dash (-) is the only accepted variant separator.

validations:
  k8s_workloads:
    checks:
      - K8sNimHelmWorkload-1b:
          model: "meta/llama-3.2-1b-instruct"
          gpu_count: 1
          timeout: 900
      - K8sNimHelmWorkload-3b:
          model: "meta/llama-3.2-3b-instruct"
          gpu_count: 4
          timeout: 1800

  slurm:
    checks:
      - SlurmPartition-cpu:
          partition_name: "cpu"
      - SlurmPartition-gpu:
          partition_name: "gpu"

The part before the dash must match an existing validation class name (e.g., K8sNimHelmWorkload, SlurmPartition). The suffix after the dash is a label -- it can be any descriptive string. Each variant runs as a separate test case with its own parameters and appears independently in test results and coverage.

Rules:

  • Validation class names cannot contain dashes, so the first dash always marks the start of a variant suffix.
  • The suffix is free-form: K8sNimHelmWorkload-small, SlurmPartition-cpu, SlurmGpuAllocation-1gpu are all valid.
  • Each variant is a distinct test entry in coverage tracking.

Template Variables

Referencing Step Outputs

Use {{steps.step_name.field}} to reference previous step outputs:

steps:
  - name: create_instance
    command: "python launch.py"
    # Output: {"instance_id": "i-xxx", "public_ip": "54.1.2.3"}

  - name: test_ssh
    command: "python test_ssh.py"
    args:
      - "--host"
      - "{{steps.create_instance.public_ip}}"

Other Variables

Variable Description
{{setting_name}} From tests.settings
{{env.VAR_NAME}} Environment variable
{{steps.name.field}} Step output field

Script Output and Schema Validation

Scripts must output valid JSON to stdout. The output is validated against schemas defined in output_schemas.py.

Schema Auto-Detection

The schema is automatically detected from the step name using a mapping system:

Step Name Pattern Schema Required Fields
create_cluster, provision_cluster cluster success, platform, cluster_name, node_count
create_network, create_vpc network success, platform
launch_instance, create_vm instance success, platform, instance_id
run_workload, run_test workload_result success, platform, status
teardown, cleanup, destroy teardown success, platform
check_api, test_api api_health success, platform
create_access_key access_key success, platform, access_key_id
create_tenant tenant success, platform, tenant_name
vpc_crud, vpc_crud_test vpc_crud success, platform
(unrecognized) generic success, platform

Common Required Fields

All schemas require these common fields:

{
  "success": true,
  "platform": "network"
}
  • success: Boolean indicating operation success
  • platform: Platform type (e.g., "network", "vm", "iam", "control_plane")

Explicit Schema Override

You can override auto-detection using the output_schema field:

steps:
  - name: my_custom_step
    command: "python ./my_script.py"
    output_schema: cluster  # Force cluster schema validation

Example Output by Schema Type

Network schema (create_network, create_vpc):

{
  "success": true,
  "platform": "network",
  "network_id": "vpc-0123456789",
  "cidr": "10.0.0.0/16",
  "subnets": [
    {"subnet_id": "subnet-aaa", "cidr": "10.0.1.0/24", "availability_zone": "us-west-2a"}
  ],
  "region": "us-west-2"
}

Cluster schema (provision_cluster):

{
  "success": true,
  "platform": "kubernetes",
  "cluster_name": "my-cluster",
  "node_count": 3,
  "endpoint": "https://cluster.example.com",
  "gpu_count": 8,
  "driver_version": "570.195.03"
}

Teardown schema (teardown, cleanup):

{
  "success": true,
  "platform": "network",
  "resources_deleted": ["vpc-123", "subnet-456"],
  "message": "Cleanup completed"
}

Exit codes:

  • 0 = Success
  • Non-zero = Failure

Python Script Example

#!/usr/bin/env python3
"""Create VPC and output JSON."""

import argparse
import json
import sys
import boto3

def main() -> int:
    parser = argparse.ArgumentParser()
    parser.add_argument("--name", required=True)
    parser.add_argument("--region", default="us-west-2")
    args = parser.parse_args()

    result = {
        "success": False,
        "platform": "network",
    }

    try:
        ec2 = boto3.client("ec2", region_name=args.region)
        vpc = ec2.create_vpc(CidrBlock="10.0.0.0/16")
        result["network_id"] = vpc["Vpc"]["VpcId"]
        result["success"] = True
    except Exception as e:
        result["error"] = str(e)

    print(json.dumps(result, indent=2))
    return 0 if result["success"] else 1

if __name__ == "__main__":
    sys.exit(main())

Bash Script Example

#!/bin/bash
# Create VPC and output JSON

set -e

NAME="${1:-test-vpc}"
REGION="${AWS_REGION:-us-west-2}"

VPC_ID=$(aws ec2 create-vpc \
  --cidr-block 10.0.0.0/16 \
  --region "$REGION" \
  --query 'Vpc.VpcId' \
  --output text)

cat <<EOF
{
  "success": true,
  "platform": "network",
  "network_id": "$VPC_ID",
  "region": "$REGION"
}
EOF

Available Validations

Generic Validations

Validation Parameters Description
FieldExistsCheck step, field or fields Check field(s) exist in output
FieldValueCheck step, field, expected, operator Check field value (eq, gt, gte, lt, lte)
StepSuccessCheck step Check success: true (auto-detects teardown)

Network Validations

Validation Parameters Description
NetworkProvisionedCheck step Check network created with ID
VpcCrudCheck step Validate VPC CRUD operations
SubnetConfigCheck step Validate subnet configuration
VpcIsolationCheck step Validate VPC isolation
SecurityBlockingCheck step Validate security blocking
NetworkConnectivityCheck step Validate connectivity tests
TrafficFlowCheck step Validate traffic flow tests

Instance Validations

Validation Parameters Description
InstanceStateCheck step, expected_state Check instance state
InstanceCreatedCheck step Check instance was created

IAM Validations

Validation Parameters Description
AccessKeyCreatedCheck step Check access key was created
AccessKeyAuthenticatedCheck step Check access key authenticated
AccessKeyDisabledCheck step Check access key was disabled
AccessKeyRejectedCheck step Check disabled key was rejected
TenantCreatedCheck step Check tenant was created
TenantListedCheck step Check tenant in list
TenantInfoCheck step Check tenant info retrieved

Kubernetes Validations

Validation Parameters Description
K8sNodeCountCheck count Verify node count
K8sNodeReadyCheck - Verify all nodes are Ready
K8sGpuOperatorPodsCheck namespace Verify GPU Operator pods running
K8sGpuCapacityCheck expected_per_node, expected_total Verify GPU capacity

Slurm Validations

Validation Parameters Description
SlurmInfoAvailable - Verify sinfo command works
SlurmPartition partition_name, expected_nodes Verify partition exists
SlurmJobSubmission partition Test job submission

Test Markers

Filter tests using pytest markers:

# Run only specific tests
isvctl test run -f config.yaml -- -k "vpc_crud"

# Run by marker
isvctl test run -f config.yaml -- -m kubernetes

Available markers: bare_metal, kubernetes, slurm, gpu, network, hardware, software, workload, l2, slow

Related Documentation