Skip to content

🔧 How to modify Karpenter configuration to correctly provision H100 GPU instances on AWS EKS? #7796

Open
@dxu104

Description

@dxu104

Description:

We have successfully deployed the following models using KubeAI on our AWS EKS cluster:

  • llama-3.1-8b-instruct-fp8-l4
  • deepseek-r1-distill-llama-8b-l4

However, when attempting to deploy the DeepSeek R1 617B model (defined in DeepSeek-R1.yaml), we are encountering scheduling issues related to Karpenter not provisioning the correct instance types.


🔍 Issue Details

1️⃣ AWS Quota and Instance Type Availability

We have verified that our Running On-Demand P instances quota is correctly set to 800 vCPUs using:

aws service-quotas list-service-quotas --service-code ec2 --region us-west-2 --query "Quotas[?contains(QuotaName, 'P instance')]"

Output:

{
    "QuotaName": "Running On-Demand P instances",
    "Value": 800.0
}

We have also confirmed that H100 GPU instances are available in the region:

aws ec2 describe-instance-type-offerings --location-type availability-zone --filters Name=instance-gpu-name,Values=h100 --region us-west-2

2️⃣ Karpenter NodePool and EC2NodeClass Configuration

NodePool configuration for GPU nodes:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["g", "p"]
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ["h100"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu
      expireAfter: 720h
      taints:
      - key: nvidia.com/gpu
        value: "true"
        effect: NoSchedule
  limits:
    cpu: 1000
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m

EC2NodeClass Configuration:

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu
spec:
  amiFamily: AL2
  role: "eksctl-KarpenterNodeRole-${CLUSTER_NAME}"
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "${CLUSTER_NAME}"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "${CLUSTER_NAME}"
  amiSelectorTerms:
    - id: "${GPU_AMI_ID}"
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 300Gi
        volumeType: gp3
        encrypted: true

3️⃣ Model Deployment Configuration (DeepSeek-R1.yaml)

apiVersion: kubeai.org/v1
kind: Model
metadata:
  name: deepseek-617b-h100
spec:
  features: [TextGeneration]
  url: hf://deepseek-ai/DeepSeek-R1
  engine: VLLM
  args:
    - --max-model-len=65536
    - --max-num-batched-token=65536
    - --gpu-memory-utilization=0.9
    - --tensor-parallel-size=8
    - --enable-prefix-caching
    - --disable-log-requests
    - --max-num-seqs=1024
    - --kv-cache-dtype=fp4
  targetRequests: 500
  resourceProfile: nvidia-gpu-h100:8

Resource Profile in values-eks.yaml:

resourceProfiles:
  nvidia-gpu-h100:
    nodeSelector:
      karpenter.k8s.aws/instance-gpu-name: "h100"

4️⃣ Scheduling Errors

We are encountering the following scheduling errors:

Warning  FailedScheduling  7m57s (x2 over 12m)  karpenter  
Failed to schedule pod, incompatible with nodepool "gpu", daemonset overhead={"cpu":"150m","pods":"5"}, 
no instance type satisfied resources {"cpu":"150m","nvidia.com/gpu":"8","pods":"6"} 
and requirements karpenter.k8s.aws/instance-category In [g p], karpenter.k8s.aws/instance-gpu-name In [h100], 
karpenter.sh/capacity-type In [on-demand], karpenter.sh/nodepool In [gpu] (no instance type met all requirements)

Expected Behavior

  • Karpenter should provision any instance type that meets the following conditions:
    • Supports at least 8 H100 GPUs.
    • Meets the CPU and memory requests as specified in the deployment configuration.
    • Falls within the On-Demand P instance quota (800 vCPUs).

Actual Behavior

  • Despite sufficient AWS quotas and available H100 GPU instances in the region, the scheduler fails to provision nodes.
  • Errors suggest a mismatch between node affinity/selector requirements and the instance types being provisioned.

Questions

  1. How should I modify the Karpenter configuration to correctly provision H100 GPU instances under the current quota and resource constraints?
  2. Are there additional configurations needed for the scheduler to recognize GPU node availability properly?

🙏 Request for Help

Any guidance or insights into modifying the Karpenter configuration for H100 GPU provisioning would be greatly appreciated. Thank you for your help!

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions