Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory pressure safety valve #1103

Merged

Conversation

bigbitbus
Copy link
Contributor

@bigbitbus bigbitbus commented Mar 23, 2025

Description

We use MAX_ACTIVE_MODELS to manage the LRU cache when new models are loaded.
This is a static number, and not reliable when dealing with heterogenous (different sized), many models.

This PR will clean the LRU cache until at least a threshold fraction of memory is available before trying to load a new model.

Currently only works for CUDA enabled devices like GPUs

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

How has this change been tested, please provide a testcase or example of how you tested the change?

Testing locally currently.

Any specific deployment considerations

For example, documentation changes, usability, usage/costs, secrets, etc.

Docs

  • Docs updated? What were the changes:

@@ -7,6 +7,8 @@
from inference.core.managers.base import Model, ModelManager
from inference.core.managers.decorators.base import ModelManagerDecorator
from inference.core.managers.entities import ModelDescription
from inference.core.env import MEMORY_FREE_THRESHOLD
import torch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch is not always installed so shouldnt be imported directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, moved it inside the function.

@@ -141,3 +143,15 @@ def _resolve_queue_id(
self, model_id: str, model_id_alias: Optional[str] = None
) -> str:
return model_id if model_id_alias is None else model_id_alias

def memory_pressure_detected(self) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would feature flag this and import torch locally here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how this would work in multi-gpu env -but probably we can wait until someone actually uses it like that (seems like we are probing default device here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(the value of MEMORY_FREE_THRESHOLD = 0 by default, and so the memory pressure is not checked by default.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@bigbitbus bigbitbus marked this pull request as ready for review March 25, 2025 15:28
@bigbitbus
Copy link
Contributor Author

Thanks for the review @PawelPeczek-Roboflow

I now import torch within the memory pressure checking function

The memory pressure check only occurs when the environment variable MEMORY_FREE_THRESHOLD is non-zero (0 is the default value and so this will not trigger)

I tested the functionality by building gpu image locally and testing that the cuda memory is reported and the memory
pressure function works as expected (returns true when high memory pressure occurs, triggering model LRU cache eviction)

I ran the integration tests here: https://github.com/roboflow/inference/actions/runs/14063849882

@grzegorz-roboflow grzegorz-roboflow merged commit 1a14322 into main Mar 27, 2025
30 checks passed
@grzegorz-roboflow grzegorz-roboflow deleted the feat/evict-models-when-cuda-memory-free-below-threshold branch March 27, 2025 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants