Update requester template to work with latest FMA#216
Update requester template to work with latest FMA#216manoelmarques wants to merge 1 commit intollm-d-incubation:mainfrom
Conversation
1051709 to
24c0a8c
Compare
rubambiza
left a comment
There was a problem hiding this comment.
LGTM. Leave some general comments without explicit approval to make sure others get a chance to weigh in.
| VLLM_LOGGING_LEVEL: DEBUG | ||
| VLLM_NIXL_SIDE_CHANNEL_PORT: "5600" | ||
| VLLM_SERVER_DEV_MODE: "1" | ||
| VLLM_USE_V1: "1" |
There was a problem hiding this comment.
I presume the requester will always be launched on GPU-enabled nodes. However, if it is launched using CPU, I wanted to flag that there is an upcoming PR from Jun that will set the VLLM_CPU_KV_CACHE_SPACE variable to avoid a kill by OOM for the launcher pod. This is just an FYI in case we need to make changes in the near future.
| command: | ||
| - /bin/bash | ||
| - -c | ||
| image: ghcr.io/llm-d-incubation/llm-d-fast-model-actuation/launcher:latest |
There was a problem hiding this comment.
@aavarghese Just a general pondering, do I understand it correctly that the tag the launcher images use follows whatever semver we are using for a (test) release that is availed to llmd-benchmark? 👇
If so, then I think whatever is being output here should not necessarily be latest, right?
It would be great to make sure that all these discussions we are going through process-wise are actually useful.
CC: @diegocastanibm
There was a problem hiding this comment.
I think latest is the best default right now?! And when Manoel will be running the benchmark with a stable FMA release or release candidate, he will use values-requester.yaml to specify the launcher/requester tag that will override latest...
|
I lack understanding of the FMA approach. It appears to me that the modelServerConfig is similar in many ways to a prefill or decode pod. I see these defined as well. Could you explain the relationship or point me to a document explaining I could look at. |
@kalantar We are working on updating the documentation for FMA as it evolves. For now, you can get an overview in this open PR: https://github.com/rubambiza/llm-d-fast-model-actuation/blob/202dc3691615f9677de9578d11e0b470815ee33d/README.md |
a7176d5 to
f2b4ca8
Compare
Co-authored-by: aavarghese <avarghese@us.ibm.com> Co-authored-by: manoelmarques <manoel.marques@ibm.com> Signed-off-by: manoelmarques <manoel.marques@ibm.com> Signed-off-by: aavarghese <avarghese@us.ibm.com>
Added new resources to the requester template. They use custom CRDs that need to be pre-installed separately:
InferenceServerConfig: https://raw.githubusercontent.com/llm-d-incubation/llm-d-fast-model-actuation/main/config/crd/fma.llm-d.ai_inferenceserverconfigs.yaml
LauncherConfig: https://raw.githubusercontent.com/llm-d-incubation/llm-d-fast-model-actuation/main/config/crd/fma.llm-d.ai_launcherconfigs.yaml