Skip to content

update: add env variable for main container(vllm) when DRA is enabled for Intel-xe#189

Merged
kalantar merged 2 commits intollm-d-incubation:mainfrom
zdtsw:chore_dra_intel
Jan 29, 2026
Merged

update: add env variable for main container(vllm) when DRA is enabled for Intel-xe#189
kalantar merged 2 commits intollm-d-incubation:mainfrom
zdtsw:chore_dra_intel

Conversation

@zdtsw
Copy link
Copy Markdown
Contributor

@zdtsw zdtsw commented Jan 27, 2026

Description

  • currently when DRA is enabled it skip adding env VLLM_WORKER_MULTIPROC_METHOD for intel-xe

@zdtsw zdtsw changed the title update: add env variable for main container(vllm) when DRA is enabled for Intel update: add env variable for main container(vllm) when DRA is enabled for Intel-xe Jan 27, 2026
@kalantar
Copy link
Copy Markdown
Collaborator

@yuanwu2017 do you have any comments on the suggestion to add this environment variable as a default? I believe you contributed these originally. I am not familiar with what default env variable make sense.

@kalantar
Copy link
Copy Markdown
Collaborator

kalantar commented Jan 27, 2026

@zdtsw I don't think this is the root cause of llm-d/llm-d#620. Is modelservice being used? Remove any gpu resource request/limits from the values file and try without them (and set accelerator.type appropriately).

@yuanwu2017
Copy link
Copy Markdown
Contributor

yuanwu2017 commented Jan 28, 2026

@zdtsw I don't think this is the root cause of llm-d/llm-d#620. Is modelservice being used? Remove any gpu resource request/limits from the values file and try without them (and set accelerator.type appropriately).

PR518 and PR380 can fix it. Issue620 is caused by the llm-d-modelservice upgrade in llm-d. The type of accelerator changed from "intel" to "intel-i915". The DRA is enabling in llm-d-modelservice, so enabling the dra also can fix the issue620.
@zdtsw did discover a potential problem: using DRA caused the previous default environment variables to stop working, which could prevent vLLM from starting. Therefore, I added these environment variables back in PR380.

@yuanwu2017 do you have any comments on the suggestion to add this environment variable as a default? I believe you contributed these originally. I am not familiar with what default env variable make sense.

I think it is ok for adding a default envs for specific device. But I have not understood how this patch to fix it. If enabling the DRA device in values.yaml, these envs values should not work. @zdtsw @poussa

@zdtsw
Copy link
Copy Markdown
Contributor Author

zdtsw commented Jan 28, 2026

@zdtsw I don't think this is the root cause of llm-d/llm-d#620. Is modelservice being used? Remove any gpu resource request/limits from the values file and try without them (and set accelerator.type appropriately).

PR518 and PR380 can fix it. Issue620 is caused by the llm-d-modelservice upgrade in llm-d. The type of accelerator changed from "intel" to "intel-i915". The DRA is enabling in llm-d-modelservice, so enabling the dra also can fix the issue620. @zdtsw did discover a potential problem: using DRA caused the previous default environment variables to stop working, which could prevent vLLM from starting. Therefore, I added these environment variables back in PR380.

@yuanwu2017 do you have any comments on the suggestion to add this environment variable as a default? I believe you contributed these originally. I am not familiar with what default env variable make sense.

I think it is ok for adding a default envs for specific device. But I have not understood how this patch to fix it. If enabling the DRA device in values.yaml, these envs values should not work. @zdtsw @poussa

@kalantar Sorry for bad description in this PR.
Originally, I saw the open llm-d/llm-d#620 and started to look into the code in llm-d and later found this bug had been fixed in llm-d PR518 but the env variable is still missing due to the change a while ago when we split Intel-xe and Intel-i915 instead of old value Intel.
This PR is mainly to catch the env variable which is preferred by vllm for Intel-xe,
Thanks @yuanwu2017 for checking this up.

@poussa
Copy link
Copy Markdown
Contributor

poussa commented Jan 28, 2026

The env object is only for device plugins (accelerator object). In DRA, we don't have the env concept at all. Wait, that is true for the original DRA implementation. Since @kalantar dra refactoring was merged yesterday, I not sure anymore, but I think that is still the case.

Anyway, 1) this PR is against the old DRA implementation and 2) is not correct since it touches the device plugins, not DRA.

@zdtsw
Copy link
Copy Markdown
Contributor Author

zdtsw commented Jan 28, 2026

as for the env VLLM_WORKER_MULTIPROC_METHOD if it is needed or not:
https://vllm-dev.slack.com/archives/C07QP347J4D/p1769590188432399

@kalantar
Copy link
Copy Markdown
Collaborator

The env object is only for device plugins (accelerator object). In DRA, we don't have the env concept at all. Wait, that is true for the original DRA implementation. Since @kalantar dra refactoring was merged yesterday, I not sure anymore, but I think that is still the case.

Anyway, 1) this PR is against the old DRA implementation and 2) is not correct since it touches the device plugins, not DRA.

It should apply to DRA now too. As long as we are using the same keys.

Comment on lines +96 to +98
# @schema
# additionalProperties: true
# @schema
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this annotation needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, it is left from my origin change before the dra refactor, let me remove it

@kalantar
Copy link
Copy Markdown
Collaborator

@zdtsw please also bump chart version in Chart.yaml and run make pre-commit-run to update the schema and make generate to update all the samples.

@zdtsw
Copy link
Copy Markdown
Contributor Author

zdtsw commented Jan 28, 2026

@zdtsw please also bump chart version in Chart.yaml and run make pre-commit-run to update the schema and make generate to update all the samples.

updated

@kalantar
Copy link
Copy Markdown
Collaborator

This looks good, let's resolve the conflict and we can merge.

- add VLLM_WORKER_MULTIPROC_METHOD: spawn for Intel-xe
- bump version
Signed-off-by: Wen Zhou <wenzhou@redhat.com>
Signed-off-by: Wen Zhou <wenzhou@redhat.com>
@zdtsw
Copy link
Copy Markdown
Contributor Author

zdtsw commented Jan 29, 2026

This looks good, let's resolve the conflict and we can merge.

thanks, it is rebased.

@kalantar kalantar merged commit c97db00 into llm-d-incubation:main Jan 29, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants