- 
                Notifications
    
You must be signed in to change notification settings  - Fork 23
 
Giant Swarm 1.33 AI conformance #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
          
     Open
      
      
            pipo02mix
  wants to merge
  1
  commit into
  cncf:main
  
    
      
        
          
  
    
      Choose a base branch
      
     
    
      
        
      
      
        
          
          
        
        
          
            
              
              
              
  
           
        
        
          
            
              
              
           
        
       
     
  
        
          
            
          
            
          
        
       
    
      
from
giantswarm:main
  
      
      
   
  
    
  
  
  
 
  
      
    base: main
Could not load branches
            
              
  
    Branch not found: {{ refName }}
  
            
                
      Loading
              
            Could not load tags
            
            
              Nothing to show
            
              
  
            
                
      Loading
              
            Are you sure you want to change the base?
            Some commits from the old base branch may be removed from the timeline,
            and old review comments may become outdated.
          
          
  
     Open
                    Changes from all commits
      Commits
    
    
  File filter
Filter by extension
Conversations
          Failed to load comments.   
        
        
          
      Loading
        
  Jump to
        
          Jump to file
        
      
      
          Failed to load files.   
        
        
          
      Loading
        
  Diff view
Diff view
There are no files selected for viewing
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              | Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,97 @@ | ||
| # Kubernetes AI Conformance Checklist | ||
| # Notes: This checklist is based on the Kubernetes AI Conformance document. | ||
| # Participants should fill in the 'status', 'evidence', and 'notes' fields for each requirement. | ||
| 
     | 
||
| metadata: | ||
| kubernetesVersion: v1.33 | ||
| platformName: Giant Swarm Platform | ||
| platformVersion: 1.33.0 | ||
| vendorName: Giant Swarm | ||
| websiteUrl: https://www.giantswarm.io/ | ||
| documentationUrl: https://docs.giantswarm.io/ | ||
| productLogoUrl: https://www.giantswarm.io/assets/img/logo.svg | ||
| description: "Giant Swarm Platform is an enterprise-grade managed Kubernetes platform for containerized applications, including stateful and stateless, AI and ML, Linux and Windows, complex and simple web apps, API, and backend services." | ||
| 
     | 
||
| spec: | ||
| accelerators: | ||
| - id: dra_support | ||
| state: TBD | ||
| description: "Support Dynamic Resource Allocation (DRA) APIs to enable more flexible and fine-grained resource requests beyond simple counts." | ||
| level: SHOULD | ||
| status: "Implemented" | ||
| evidence: | ||
| - "https://docs.giantswarm.io/tutorials/fleet-management/cluster-management/dynamic-resource-allocation/" | ||
| notes: "" | ||
| networking: | ||
| - id: ai_inference | ||
| state: TBD | ||
| description: "Support the Kubernetes Gateway API with an implementation for advanced traffic management for inference services, which enables capabilities like weighted traffic splitting, header-based routing (for OpenAI protocol headers), and optional integration with service meshes." | ||
| level: MUST | ||
| status: "Implemented" | ||
| evidence: | ||
| - "https://docs.giantswarm.io/tutorials/connectivity/gateway-api/" | ||
| notes: "" | ||
| schedulingOrchestration: | ||
| - id: gang_scheduling | ||
| state: TBD | ||
| description: "The platform must allow for the installation and successful operation of at least one gang scheduling solution that ensures all-or-nothing scheduling for distributed AI workloads (e.g. Kueue, Volcano, etc.) To be conformant, the vendor must demonstrate that their platform can successfully run at least one such solution." | ||
| level: MUST | ||
| status: "Implemented" | ||
| evidence: | ||
| - "https://docs.giantswarm.io/tutorials/fleet-management/job-management/kueue/" | ||
| notes: "" | ||
| - id: cluster_autoscaling | ||
| state: TBD | ||
| description: "If the platform provides a cluster autoscaler or an equivalent mechanism, it must be able to scale up/down node groups containing specific accelerator types based on pending pods requesting those accelerators." | ||
| level: MUST | ||
| status: "Implemented" | ||
| evidence: | ||
| - "https://docs.giantswarm.io/tutorials/fleet-management/cluster-management/aws-cluster-scaling/" | ||
| - "https://docs.giantswarm.io/tutorials/fleet-management/cluster-management/cluster-autoscaler/" | ||
| - "https://karpenter.sh/docs/concepts/scheduling/#acceleratorsgpu-resources" | ||
| notes: "" | ||
| - id: pod_autoscaling | ||
| state: TBD | ||
| description: "If the platform supports the HorizontalPodAutoscaler, it must function correctly for pods utilizing accelerators. This includes the ability to scale these Pods based on custom metrics relevant to AI/ML workloads." | ||
| level: MUST | ||
| status: "Implemented" | ||
| evidence: | ||
| - "https://docs.giantswarm.io/tutorials/fleet-management/scaling-workloads/scaling-based-on-custom-metrics" | ||
| notes: "" | ||
| observability: | ||
| - id: accelerator_metrics | ||
| state: TBD | ||
| description: "For supported accelerator types, the platform must allow for the installation and successful operation of at least one accelerator metrics solution that exposes fine-grained performance metrics via a standardized, machine-readable metrics endpoint. This must include a core set of metrics for per-accelerator utilization and memory usage. Additionally, other relevant metrics such as temperature, power draw, and interconnect bandwidth should be exposed if the underlying hardware or virtualization layer makes them available. The list of metrics should align with emerging standards, such as OpenTelemetry metrics, to ensure interoperability. The platform may provide a managed solution, but this is not required for conformance." | ||
| level: MUST | ||
| status: "Implemented" | ||
| evidence: | ||
| - "https://docs.giantswarm.io/tutorials/fleet-management/cluster-management/gpu/#monitoring" | ||
| - "https://docs.giantswarm.io/overview/observability/configuration/" | ||
| notes: "" | ||
| - id: ai_service_metrics | ||
| state: TBD | ||
| description: "Provide a monitoring system capable of discovering and collecting metrics from workloads that expose them in a standard format (e.g. Prometheus exposition format). This ensures easy integration for collecting key metrics from common AI frameworks and servers." | ||
| level: MUST | ||
| status: "Implemented" | ||
| evidence: | ||
| - "https://docs.giantswarm.io/getting-started/observe-your-clusters-and-apps/" | ||
| - "https://docs.giantswarm.io/overview/observability/data-management/data-ingestion/" | ||
| notes: "" | ||
| security: | ||
| - id: secure_accelerator_access | ||
| state: TBD | ||
| description: "Ensure that access to accelerators from within containers is properly isolated and mediated by the Kubernetes resource management framework (device plugin or DRA) and container runtime, preventing unauthorized access or interference between workloads." | ||
| level: MUST | ||
| status: "Implemented" | ||
| evidence: | ||
| - "secure_accelerator_access_tests.md" | ||
| notes: "" | ||
| operator: | ||
| - id: robust_controller | ||
| state: TBD | ||
| description: "The platform must prove that at least one complex AI operator with a CRD (e.g., Ray, Kubeflow) can be installed and functions reliably. This includes verifying that the operator's pods run correctly, its webhooks are operational, and its custom resources can be reconciled." | ||
| level: MUST | ||
| status: "Implemented" | ||
| evidence: | ||
| - "https://docs.giantswarm.io/tutorials/fleet-management/job-management/kuberay" | ||
| notes: "" | ||
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              | Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,275 @@ | ||
| ### Giant Swarm Platform | ||
| 
     | 
||
| Giant Swarm Platform is a managed Kubernetes platform developed by [Giant Swarm](https://www.giantswarm.io). | ||
| 
     | 
||
| ### How to Reproduce | ||
| 
     | 
||
| #### Create Cluster | ||
| 
     | 
||
| First access the [Giant Swarm Platform](https://docs.giantswarm.io/getting-started/), and login to platform API. | ||
| After successful login, select [Create a cluster](https://docs.giantswarm.io/getting-started/provision-your-first-workload-cluster/) with the specific DRA values. | ||
| 
     | 
||
| ```yaml | ||
| global: | ||
| connectivity: | ||
| availabilityZoneUsageLimit: 3 | ||
| network: {} | ||
| topology: {} | ||
| controlPlane: {} | ||
| metadata: | ||
| name: $CLUSTER | ||
| $organization: fer | ||
| preventDeletion: false | ||
| nodePools: | ||
| nodepool0: | ||
| instanceType: m5.xlarge | ||
| maxSize: 2 | ||
| minSize: 1 | ||
| rootVolumeSizeGB: 8 | ||
| nodepool1: | ||
| instanceType: p4d.24xlarge | ||
| maxSize: 2 | ||
| minSize: 1 | ||
| rootVolumeSizeGB: 15 | ||
| instanceWarmup: 600 | ||
| minHealthyPercentage: 90 | ||
| customNodeTaints: | ||
| - key: "nvidia.com/gpu" | ||
| value: "Exists" | ||
| effect: "NoSchedule" | ||
| providerSpecific: {} | ||
| release: | ||
| version: 33.0.0 | ||
| cluster: | ||
| internal: | ||
| advancedConfiguration: | ||
| controlPlane: | ||
| apiServer: | ||
| featureGates: | ||
| - name: DynamicResourceAllocation | ||
| enabled: true | ||
| controllerManager: | ||
| featureGates: | ||
| - name: DynamicResourceAllocation | ||
| enabled: true | ||
| scheduler: | ||
| featureGates: | ||
| - name: DynamicResourceAllocation | ||
| enabled: true | ||
| kubelet: | ||
| featureGates: | ||
| - name: DynamicResourceAllocation | ||
| enabled: true | ||
| ``` | ||
| 
     | 
||
| # AI platform components | ||
| 
     | 
||
| The following components should be installed to complete the AI setup: | ||
| 
     | 
||
| ## 1. NVIDIA GPU Operator | ||
| 
     | 
||
| **Purpose**: Manages NVIDIA GPU resources in Kubernetes clusters. | ||
| 
     | 
||
| **Installation via Giant Swarm App Platform**: | ||
| 
     | 
||
| ```sh | ||
| kubectl gs template app \ | ||
| --catalog giantswarm \ | ||
| --name gpu-operator \ | ||
| --cluster-name $CLUSTER \ | ||
| --target-namespace kube-system \ | ||
| --version 1.0.1 \ | ||
| --organization $ORGANIZATION | kubectl apply -f - | ||
| ``` | ||
| 
     | 
||
| ## 2. NVIDIA DRA Driver GPU | ||
| 
     | 
||
| **Purpose**: Provides Dynamic Resource Allocation (DRA) support for NVIDIA GPUs. | ||
| 
     | 
||
| **Installation via Flux HelmRelease**: | ||
| 
     | 
||
| ```sh | ||
| # First create the NVIDIA Helm Repository | ||
| kubectl apply -f - <<EOF | ||
| apiVersion: source.toolkit.fluxcd.io/v1beta2 | ||
| kind: HelmRepository | ||
| metadata: | ||
| name: nvidia | ||
| namespace: org-$ORGANIZATION | ||
| spec: | ||
| interval: 1h | ||
| url: https://helm.ngc.nvidia.com/nvidia | ||
| EOF | ||
| 
     | 
||
| # Then create the HelmRelease | ||
| kubectl apply -f - <<EOF | ||
| apiVersion: helm.toolkit.fluxcd.io/v2 | ||
| kind: HelmRelease | ||
| metadata: | ||
| name: $CLUSTER-nvidia-dra-driver-gpu | ||
| namespace: org-$ORGANIZATION | ||
| spec: | ||
| interval: 5m | ||
| chart: | ||
| spec: | ||
| chart: nvidia-dra-driver-gpu | ||
| version: "25.3.0" | ||
| sourceRef: | ||
| kind: HelmRepository | ||
| name: nvidia | ||
| targetNamespace: kube-system | ||
| kubeConfig: | ||
| secretRef: | ||
| name: $CLUSTER-kubeconfig | ||
| key: value | ||
| values: | ||
| nvidiaDriverRoot: "/" | ||
| resources: | ||
| gpus: | ||
| enabled: false | ||
| EOF | ||
| ``` | ||
| 
     | 
||
| ## 3. Kuberay Operator | ||
| 
     | 
||
| **Purpose**: Manages Ray clusters for distributed AI/ML workloads. | ||
| 
     | 
||
| **Installation via Giant Swarm App Platform**: | ||
| 
     | 
||
| ```sh | ||
| kubectl gs template app \ | ||
| --catalog giantswarm \ | ||
| --name kuberay-operator \ | ||
| --cluster-name $CLUSTER \ | ||
| --target-namespace kube-system \ | ||
| --version 1.0.0 \ | ||
| --organization $ORGANIZATION | kubectl apply -f - | ||
| ``` | ||
| 
     | 
||
| ## 4. Kueue | ||
| 
     | 
||
| **Purpose**: Provides job queueing and resource management for batch workloads. | ||
| 
     | 
||
| **Installation via Flux HelmRelease**: | ||
| 
     | 
||
| ```sh | ||
| # First create the Kueue Helm Repository | ||
| kubectl gs template app \ | ||
| --catalog=giantswarm \ | ||
| --cluster-name$CLUSTER\ | ||
| --organization=ORGANIZATION \ | ||
| --name=kueue \ | ||
| --target-namespace=kueue-system \ | ||
| --version=0.1.0 | kubectl apply -f - | ||
| ``` | ||
| 
     | 
||
| ## 5. Gateway API | ||
| 
     | 
||
| **Purpose**: Provides advanced traffic management capabilities for inference services. | ||
| 
     | 
||
| **Installation via Giant Swarm App Platform**: | ||
| 
     | 
||
| ```sh | ||
| kubectl gs template app \ | ||
| --catalog giantswarm \ | ||
| --name gateway-api-bundle \ | ||
| --cluster-name $CLUSTER \ | ||
| --target-namespace kube-system \ | ||
| --version 0.5.1 \ | ||
| --organization $ORGANIZATION | kubectl apply -f - | ||
| ``` | ||
| 
     | 
||
| ## 6. AWS EFS CSI Driver | ||
| 
     | 
||
| **Purpose**: Enables persistent storage using AWS Elastic File System for shared AI model storage. | ||
| 
     | 
||
| **Installation via Giant Swarm App Platform**: | ||
| 
     | 
||
| ```sh | ||
| kubectl gs template app \ | ||
| --catalog giantswarm \ | ||
| --name aws-efs-csi-driver \ | ||
| --cluster-name $CLUSTER \ | ||
| --target-namespace kube-system \ | ||
| --version 2.1.5 \ | ||
| --organization $ORGANIZATION | kubectl apply -f - | ||
| ``` | ||
| 
     | 
||
| ## 7. JobSet | ||
| 
     | 
||
| **Purpose**: Manages sets of Jobs for distributed training workloads. | ||
| 
     | 
||
| **Installation via Flux HelmRelease**: | ||
| 
     | 
||
| ```sh | ||
| kubectl apply -f - <<EOF | ||
| apiVersion: helm.toolkit.fluxcd.io/v2 | ||
| kind: HelmRelease | ||
| metadata: | ||
| name: $CLUSTER-jobset | ||
| namespace: org-$ORGANIZATION | ||
| spec: | ||
| interval: 5m | ||
| chart: | ||
| spec: | ||
| chart: oci://registry.k8s.io/jobset/charts/jobset | ||
| version: "0.10.1" | ||
| targetNamespace: kube-system | ||
| kubeConfig: | ||
| secretRef: | ||
| name: $CLUSTER-kubeconfig | ||
| key: value | ||
| EOF | ||
| ``` | ||
| 
     | 
||
| ## 9. Prometheus Adapter | ||
| 
     | 
||
| **Purpose**: Enables custom metrics for Horizontal Pod Autoscaler, including AI/ML specific metrics. | ||
| 
     | 
||
| **Installation via Flux HelmRelease**: | ||
| 
     | 
||
| ```sh | ||
| kubectl gs template app \ | ||
| --catalog=giantswarm \ | ||
| --cluster-name=$CLUSTER \ | ||
| --org $ORGANIZATION \ | ||
| --name=keda \ | ||
| --target-namespace=keda-system \ | ||
| --version=3.1.0 | kubectl apply -f - | ||
| ``` | ||
| 
     | 
||
| ## 10. Sonobuoy Configuration | ||
| 
     | 
||
| **Purpose**: Applies PolicyExceptions and configurations needed for AI conformance testing. | ||
| 
     | 
||
| **Installation**: Applied directly to the workload cluster using the kubeconfig: | ||
| 
     | 
||
| ```sh | ||
| # Download and apply the configuration | ||
| kubectl --kubeconfig=/path/to/workload-cluster-kubeconfig apply -f https://gist.githubusercontent.com/pipo02mix/80415c1182a5920af46a85c7adf90a8a/raw/d75d7593194fb2a3beba0549f946cb6f8a5a5f46/sonobuoy-rews.yaml | ||
| ``` | ||
| 
     | 
||
| All these components work together to provide a complete AI/ML platform on Kubernetes with GPU support, workload management, monitoring, and conformance testing capabilities. | ||
| 
     | 
||
| #### Run conformance Test by Sonobuoy | ||
| 
     | 
||
| Login to the control-plane of the cluster created by Giant Swarm Platform. | ||
| 
     | 
||
| Start the conformance tests: | ||
| 
     | 
||
| ```sh | ||
| sonobuoy run --plugin https://raw.githubusercontent.com/pipo02mix/ai-conformance/c0f5f45e131445e1cf833276ca66e251b1b200e9/sonobuoy-plugin.yaml | ||
| ```` | ||
| 
     | 
||
| Monitor the conformance tests by tracking the sonobuoy logs, and wait for the line: "no-exit was specified, sonobuoy is now blocking" | ||
| 
     | 
||
| ```sh | ||
| stern -n sonobuoy sonobuoy | ||
| ``` | ||
| 
     | 
||
| Retrieve result: | ||
| 
     | 
||
| ```sh | ||
| outfile=$(sonobuoy retrieve) | ||
| sonobuoy results $outfile | ||
| ``` | 
      
      Oops, something went wrong.
        
    
  
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
statefield isn't needed https://github.com/cncf/ai-conformance/blob/main/docs/AIConformance-1.33.yaml You can just remove them. I see you have thestatusfield filled out.