-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory pressure safety valve #1103
Memory pressure safety valve #1103
Conversation
@@ -7,6 +7,8 @@ | |||
from inference.core.managers.base import Model, ModelManager | |||
from inference.core.managers.decorators.base import ModelManagerDecorator | |||
from inference.core.managers.entities import ModelDescription | |||
from inference.core.env import MEMORY_FREE_THRESHOLD | |||
import torch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch is not always installed so shouldnt be imported directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, moved it inside the function.
@@ -141,3 +143,15 @@ def _resolve_queue_id( | |||
self, model_id: str, model_id_alias: Optional[str] = None | |||
) -> str: | |||
return model_id if model_id_alias is None else model_id_alias | |||
|
|||
def memory_pressure_detected(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would feature flag this and import torch locally here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure how this would work in multi-gpu env -but probably we can wait until someone actually uses it like that (seems like we are probing default device here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(the value of MEMORY_FREE_THRESHOLD = 0 by default, and so the memory pressure is not checked by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
Thanks for the review @PawelPeczek-Roboflow I now import torch within the memory pressure checking function The memory pressure check only occurs when the environment variable MEMORY_FREE_THRESHOLD is non-zero (0 is the default value and so this will not trigger) I tested the functionality by building gpu image locally and testing that the cuda memory is reported and the memory I ran the integration tests here: https://github.com/roboflow/inference/actions/runs/14063849882 |
Description
We use MAX_ACTIVE_MODELS to manage the LRU cache when new models are loaded.
This is a static number, and not reliable when dealing with heterogenous (different sized), many models.
This PR will clean the LRU cache until at least a threshold fraction of memory is available before trying to load a new model.
Currently only works for CUDA enabled devices like GPUs
Type of change
Please delete options that are not relevant.
How has this change been tested, please provide a testcase or example of how you tested the change?
Testing locally currently.
Any specific deployment considerations
For example, documentation changes, usability, usage/costs, secrets, etc.
Docs