Skip to content

modified lws chart to support aws efa#212

Open
masoodfaisal wants to merge 2 commits intoawslabs:mainfrom
masoodfaisal:main
Open

modified lws chart to support aws efa#212
masoodfaisal wants to merge 2 commits intoawslabs:mainfrom
masoodfaisal:main

Conversation

@masoodfaisal
Copy link

What does this PR do?

For distributed inferencing using LWS, Amazon EFA provide a low latency inter-node connection that improves the latency of the inference. This PR provide an option to utilise Amazon EFA along with the Amaon DLC container images.
DLC images comes pre-baked with EFA libraries which are also added with the example

Motivation

LWS with Amzon EFA provide a way for multiple model layers to communicate over a low latency network resulting in better latency,

More

I have added the unit test for my changes in the repo too.

  • Yes, I have tested the PR using my local account setup (Provide any test evidence report under Additional Notes)
  • Mandatory for new blueprints. Yes, I have added a example to support my blueprint PR
  • Mandatory for new blueprints. Yes, I have updated the website/docs or website/blog section for this feature
  • Yes, I ran pre-commit run -a with this PR. Link for installing pre-commit locally

For Moderators

  • E2E Test successfully complete before merge?

Additional Notes

{{- if and ( eq .Values.inference.accelerator "gpu" ) ( eq .Values.inference.awsEfa true ) }}
- name: FI_PROVIDER
value: "efa"
- name: FI_EFA_USE_DEVICE_RDMA
Copy link
Contributor

@erezzarum erezzarum Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?
I believe TP/PP uses NCCL, in this case we should use the tuner plugin to adjust the correct parameters and not to force anything.
Please test with NCCL_TUNER_PLUGIN=ofi when using our DLCs

modelServer:
image:
# refer to https://github.com/aws/deep-learning-containers/blob/master/vllm/CHANGELOG.md#0110---2025-10-08
repository: 763104351884.dkr.ecr.us-east-1.amazonaws.com/vllm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the blueprints, we should use the public ECR and not to force a region, we can give example how to optimize pulling by using DLC's private ECR repo, but in general, we should opt for the public ECR in the general examples.
https://gallery.ecr.aws/deep-learning-containers/vllm

@omrishiv
Copy link
Contributor

omrishiv commented Nov 7, 2025

We now have the public ECR for DLC on all the vLLM images, please use that, also please reopen this PR in awslabs/ai-on-eks-charts as we now have a proper helm repository for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants