[Data] Move gauges out of streaming executor#60275
[Data] Move gauges out of streaming executor#60275limarkdcunha wants to merge 0 commit intoray-project:masterfrom
Conversation
|
@bveeramani please review this draft pull, if the overall structure is good I will take care of adding the tests. Thanks |
There was a problem hiding this comment.
Code Review
This pull request effectively refactors the Prometheus metrics tracking for resource budgets out of StreamingExecutor and into a new ResourceAllocatorPrometheusCallback class. This is a good change that improves code organization and adheres to the single responsibility principle. The implementation in the new callback class is a correct translation of the original logic.
I have a couple of suggestions for the new ResourceAllocatorPrometheusCallback class to improve its design and consistency. One is about improving encapsulation by not accessing private members of StreamingExecutor directly. The other is a minor consistency improvement in how arguments are passed to function calls.
| topology = executor._topology | ||
| resource_manager = executor._resource_manager | ||
| dataset_id = executor._dataset_id | ||
|
|
||
| if topology is None or resource_manager is None: | ||
| return | ||
|
|
||
| for i, op in enumerate(topology): | ||
| tags = { | ||
| "dataset": dataset_id, | ||
| "operator": executor._get_operator_id(op, i), | ||
| } |
There was a problem hiding this comment.
This callback accesses several private members of StreamingExecutor (_topology, _resource_manager, _dataset_id, and _get_operator_id). This breaks encapsulation and makes the callback tightly coupled to the implementation of StreamingExecutor.
To improve this, consider adding public properties or methods to StreamingExecutor to expose the necessary information. For example:
In StreamingExecutor:
@property
def topology(self):
return self._topology
# ... and so on for other members
def get_operator_id(self, op: PhysicalOperator, topology_index: int) -> str:
# This can be a public method that wraps the private one.
return self._get_operator_id(op, topology_index)This would make the callback system more robust and maintainable.
There was a problem hiding this comment.
I think this is valid, but I think it's okay if we address in a follow-up to keep this PR as just moving code around
python/ray/data/_internal/execution/resource_allocator_prometheus_callback.py
Outdated
Show resolved
Hide resolved
bveeramani
left a comment
There was a problem hiding this comment.
LGTM except for docstring nits.
| """ | ||
| Callback that monitors the StreamingExecutor and updates Prometheus | ||
| metrics related to resource allocation (CPU/GPU budgets, memory, etc.). | ||
| """ |
There was a problem hiding this comment.
Nit: Ray follows Google docstring convention. For consistency with convention, could you write this as a one-line summary on the same line as the triple-quotes and add additional description below?
| """ | |
| Callback that monitors the StreamingExecutor and updates Prometheus | |
| metrics related to resource allocation (CPU/GPU budgets, memory, etc.). | |
| """ | |
| """Single-line summary on same line as triple-quotes. | |
| More description. | |
| """ |
| """ | ||
| Called by the executor after every scheduling loop step. | ||
| """ |
There was a problem hiding this comment.
| """ | |
| Called by the executor after every scheduling loop step. | |
| """ | |
| """Called by the executor after every scheduling loop step.""" |
| topology = executor._topology | ||
| resource_manager = executor._resource_manager | ||
| dataset_id = executor._dataset_id | ||
|
|
||
| if topology is None or resource_manager is None: | ||
| return | ||
|
|
||
| for i, op in enumerate(topology): | ||
| tags = { | ||
| "dataset": dataset_id, | ||
| "operator": executor._get_operator_id(op, i), | ||
| } |
There was a problem hiding this comment.
I think this is valid, but I think it's okay if we address in a follow-up to keep this PR as just moving code around
| self._callbacks = [ResourceAllocatorPrometheusCallback()] | ||
| self._callbacks.extend(get_execution_callbacks(self._data_context)) |
There was a problem hiding this comment.
This code special-cases ResourceAllocatorPrometheusCallback, and I think special cases will make the code harder to read.
In a follow-up PR, I think we should unify these.
There was a problem hiding this comment.
Hey this is already tech debt right, so I would like to avoid even moving some part of this to sometime later, lets fix this in this PR itself, I will spend some more time on it if that works with you?
There was a problem hiding this comment.
I think the refactor will be a non-trivial change. To keep the diff for this PR small and easily reviewable, I think I'd prefer to land this as-is but address the refactor shortly in a follow-up. Just want to avoid a situation where we have a large PR with many different changes
There was a problem hiding this comment.
Here's an Issue to describe the tech debt issue -- do you want to take a stab it? #60279
There was a problem hiding this comment.
Sure that works I will add a test for this new class and minor changes per the prev comments and will take up the issue, you can assign it to me Thanks
There was a problem hiding this comment.
Sweet. Would you mind commenting on the Issue so I can assign it to you?
There was a problem hiding this comment.
Btw, lemme know when docstring nits have been addressed on this PR, and I'll merge it
There was a problem hiding this comment.
@bveeramani made the changes as discussed, let me know if something else is needed in this PR or this is good to be merged from my side, I will start taking a look at the other issue. If possible please assign it to me. Thanks
|
|
||
| # Trigger callbacks to ensure resource gauges receive a final update. | ||
| for callback in self._callbacks: | ||
| callback.on_execution_step(self) | ||
|
|
There was a problem hiding this comment.
I think we call after_execution_succeeds and after_execution_fails later during shutdown -- any reason we can't use those to ensure the gauges receive a final update? Seems unintuitive to call on_execution_step hooks during shutdown
There was a problem hiding this comment.
Yeah that makes much more sense now that I think about it Thanks will make the changes and let you know
There was a problem hiding this comment.
@bveeramani I have updated it by wrapping on_execution_step inside after_execution_succeeds and after_execution_fails so from consumer class perspective they look different as per needed. Let me know if that makes sense or else I can look for another approach. Thanks
|
I closed this for now, just as discussed will pickup this after #60279 is closed. |
This PR moves out responsibility of tracking Prometheus metrics for resource budgets into a new class ResourceAllocatorPrometheusCallback. This is in accordance with single responsibility principle.
Closes #60269