Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Get Neuron device and core count from EC2 API for all trn* and inf* instance types #6510

Merged
merged 15 commits into from
Oct 31, 2024

Conversation

bryantbiggs
Copy link
Member

@bryantbiggs bryantbiggs commented Jul 13, 2024

Fixes #3555

Description

  • Removes trn* static resource definitions and instead pulls directly from EC2 API to get device count, core count, and total device memory
  • Converts inf* resource requirement collection from InferenceAcceleratorInfo to NeuronInfo; this aligns with trn* resource information collection for all Neuron related resource information
    • You can compare the output of the following to see that NeuronInfo supersedes InferenceAcceleratorInfo:
      • aws ec2 describe-instance-types --query 'InstanceTypes[*].NeuronInfo'
      • aws ec2 describe-instance-types --query 'InstanceTypes[*].InferenceAcceleratorInfo'
  • Adds support for aws.amazon.com/neuroncore which is used for allocating neuron cores to the container
  • Adds Neuron device plugin to e2e integration test suite for neuron devices similar to NVIDIA and EFA device plugins
  • Adds new region us-east-2 to collect instance details for the reference instance types doc page that are not found in us-east-1 nor us-west-2
  • Auto updates from running make codegen and make docgen

How was this change tested?

  • make test

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link

netlify bot commented Jul 13, 2024

Deploy Preview for karpenter-docs-prod ready!

Name Link
🔨 Latest commit 30e265d
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/6724036cf5be5a0008448173
😎 Deploy Preview https://deploy-preview-6510--karpenter-docs-prod.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@bryantbiggs bryantbiggs force-pushed the feat/neuron-core branch 2 times, most recently from 3d1b8da to e894f8a Compare July 13, 2024 16:16
Copy link
Contributor

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

Copy link
Contributor

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

@engedaam
Copy link
Contributor

Any update here?

@bryantbiggs bryantbiggs force-pushed the feat/neuron-core branch 3 times, most recently from d48abe9 to f5fdc84 Compare September 11, 2024 21:28
@coveralls
Copy link

coveralls commented Sep 11, 2024

Pull Request Test Coverage Report for Build 11620786418

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 23 of 23 (100.0%) changed or added relevant lines in 2 files are covered.
  • 54 unchanged lines in 6 files lost coverage.
  • Overall coverage decreased (-0.3%) to 82.812%

Files with Coverage Reduction New Missed Lines %
pkg/controllers/controllers.go 2 0.0%
pkg/test/environment.go 4 96.0%
pkg/providers/instancetype/types.go 5 96.63%
pkg/operator/operator.go 5 9.15%
pkg/providers/instancetype/instancetype.go 13 93.47%
pkg/providers/instance/instance.go 25 89.12%
Totals Coverage Status
Change from base Build 11407413769: -0.3%
Covered Lines: 5642
Relevant Lines: 6813

💛 - Coveralls

@bryantbiggs bryantbiggs marked this pull request as ready for review September 11, 2024 22:36
@bryantbiggs bryantbiggs requested a review from a team as a code owner September 11, 2024 22:36
Copy link
Contributor

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

@bryantbiggs bryantbiggs force-pushed the feat/neuron-core branch 2 times, most recently from 06c6bbb to 964090e Compare October 1, 2024 18:08
Copy link
Contributor

@engedaam engedaam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to let the integration tests run and we should be go to merge

/karpenter snapshot

Copy link
Contributor

Snapshot successfully published to oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter:0-8b93d2d7e5a7d793dc4ba33409059615a6f04020.
To install you must login to the ECR repo with an AWS account:

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 021119463062.dkr.ecr.us-east-1.amazonaws.com

helm upgrade --install karpenter oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter --version "0-8b93d2d7e5a7d793dc4ba33409059615a6f04020" --namespace "kube-system" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

@bryantbiggs bryantbiggs force-pushed the feat/neuron-core branch 3 times, most recently from 18c3450 to d5b11a9 Compare October 18, 2024 23:07
Copy link
Contributor

@engedaam engedaam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/karpenter snapshot

Copy link
Contributor

Snapshot successfully published to oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter:0-45c3e77b5dba6fb9fd30249bf0ff0894e0074a82.
To install you must login to the ECR repo with an AWS account:

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 021119463062.dkr.ecr.us-east-1.amazonaws.com

helm upgrade --install karpenter oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter --version "0-45c3e77b5dba6fb9fd30249bf0ff0894e0074a82" --namespace "kube-system" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

Copy link
Contributor

@engedaam engedaam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

Copy link
Contributor

@jmdeal jmdeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, one suggested update but other than that this looks good to me

Co-authored-by: Jason Deal <[email protected]>
Copy link
Contributor

@jmdeal jmdeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

Copy link
Contributor

@engedaam engedaam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@jmdeal jmdeal merged commit c2f019d into aws:main Oct 31, 2024
19 checks passed
@bryantbiggs bryantbiggs deleted the feat/neuron-core branch November 1, 2024 00:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for aws.amazon.com/neuroncore
6 participants