Feature Area
Other
Problem Statement
The maxSleepingInstances (
|
MaxSleepingInstances int32 `json:"maxSleepingInstances,omitempty"` |
) of a LauncherConfig and the design of LauncherPopulationPolicy allow the authors of these objects to limit the amount of main memory that will be used by vllm instances. But there are two important gaps that remain.
(1) These limits are in terms of number of vllm instances, not main memory used. The amount of main memory used by a vllm instance can vary A LOT from model to model and depending on other parameters of vllm.
(2) These limits only apply within the scope of one LauncherConfig. If/when/while there are multiple LauncherConfig objects present, there is currently nothing that speaks to the sum of main memory usage over all the LauncherConfig objects (other than, of course, the sum of what is allowed for each LauncherConfig --- which I suspect is not enough control).
Proposed Solution
TBD
Alternatives Considered
No response
Willingness to Contribute
Yes, I can submit a PR
Additional Context
No response
Feature Area
Other
Problem Statement
The
maxSleepingInstances(llm-d-fast-model-actuation/api/fma/v1alpha1/launcherconfig_types.go
Line 32 in da4a87a
(1) These limits are in terms of number of vllm instances, not main memory used. The amount of main memory used by a vllm instance can vary A LOT from model to model and depending on other parameters of vllm.
(2) These limits only apply within the scope of one LauncherConfig. If/when/while there are multiple LauncherConfig objects present, there is currently nothing that speaks to the sum of main memory usage over all the LauncherConfig objects (other than, of course, the sum of what is allowed for each LauncherConfig --- which I suspect is not enough control).
Proposed Solution
TBD
Alternatives Considered
No response
Willingness to Contribute
Yes, I can submit a PR
Additional Context
No response