-
Notifications
You must be signed in to change notification settings - Fork 7k
[Data] Add PhysicalOperator.min_max_resource_usage_bounds
#52502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
python/ray/data/_internal/execution/interfaces/physical_operator.py
Outdated
Show resolved
Hide resolved
PhysicalOperator.max_resource_usagePhysicalOperator.min_max_resource_usage_bounds
| def min_max_resource_usage_bounds( | ||
| self, | ||
| ) -> Tuple[ExecutionResources, ExecutionResources]: | ||
| """Returns the min and max resources to start the operator and make progress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's expand that these are derived from operator's concurrency configuration multiplied by single task/actor resource requirements.
|
|
||
| def base_resource_usage(self) -> ExecutionResources: | ||
| """Returns the minimum amount of resources required for execution. | ||
| def min_max_resource_usage_bounds( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the back and forth, but on a second thought i've realized that min_max_resource_requirements might be a better fit here (will defer to you to decide)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min_max_resource_requirements sounds slightly better to me as well
| """Returns the minimum amount of resources required for execution. | ||
| def min_max_resource_usage_bounds( | ||
| self, | ||
| ) -> Tuple[ExecutionResources, ExecutionResources]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would we handle None in the resource manager? Would it be equivalent to the default implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it'd default to not knowing resource reqs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, i see that you're defaulting to [0, inf) that's fine too
python/ray/data/_internal/execution/operators/actor_pool_map_operator.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/execution/operators/actor_pool_map_operator.py
Show resolved
Hide resolved
| def min_max_resource_usage_bounds( | ||
| self, | ||
| ) -> Tuple[ExecutionResources, ExecutionResources]: | ||
| raise NotImplementedError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make sure there's a default implementation (so that it doesn't break any custom operators)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the map op. there will only be 2 sub classes: task/actor pool map ops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other comment:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responded in the thread for #52502 (comment)
| def _min_resource_usage(self) -> ExecutionResources: | ||
| # Make sure the reserved resources are at least to allow one task. | ||
| return self.incremental_resource_usage() | ||
|
|
||
| def _max_resource_usage(self) -> ExecutionResources: | ||
| if self._inputs_complete: | ||
| # If the operator has already received all input data, we know it won't | ||
| # launch more tasks. So, we only need to reserve resources for the tasks | ||
| # that are currently running. | ||
| num_cpus_per_task = self._ray_remote_args.get("num_cpus", 0) | ||
| num_gpus_per_task = self._ray_remote_args.get("num_gpus", 0) | ||
| object_store_memory_per_task = ( | ||
| self._metrics.obj_store_mem_max_pending_output_per_task or 0 | ||
| ) | ||
| resources = ExecutionResources.for_limits( | ||
| cpu=num_cpus_per_task * self.num_active_tasks(), | ||
| gpu=num_gpus_per_task * self.num_active_tasks(), | ||
| object_store_memory=object_store_memory_per_task | ||
| * self.num_active_tasks(), | ||
| ) | ||
| else: | ||
| resources = ExecutionResources.for_limits() | ||
|
|
||
| return resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's generalize this to be base method in PhysicalOperator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a sensible default for PhysicalOperator. The implementation makes sense for TaskPoolOperator, but I don't think we can assume it makes sense for all PhysicalOperator subclasses (e.g., the current all-to-all operator implementation)
| min_resource_usage, max_resource_usage = op.min_max_resource_usage_bounds() | ||
| reserved_for_tasks = default_reserved.subtract(reserved_for_outputs) | ||
| reserved_for_tasks = reserved_for_tasks.max(min_resource_usage) | ||
| reserved_for_tasks = reserved_for_tasks.min(max_resource_usage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much, much cleaner!
|
|
||
| def base_resource_usage(self) -> ExecutionResources: | ||
| """Returns the minimum amount of resources required for execution. | ||
| def min_max_resource_usage_bounds( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min_max_resource_requirements sounds slightly better to me as well
python/ray/data/_internal/execution/operators/actor_pool_map_operator.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/execution/operators/actor_pool_map_operator.py
Outdated
Show resolved
Hide resolved
| def min_max_resource_usage_bounds( | ||
| self, | ||
| ) -> Tuple[ExecutionResources, ExecutionResources]: | ||
| raise NotImplementedError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the map op. there will only be 2 sub classes: task/actor pool map ops.
python/ray/data/_internal/execution/operators/task_pool_map_operator.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/execution/operators/actor_pool_map_operator.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/execution/operators/task_pool_map_operator.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/execution/interfaces/physical_operator.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/execution/operators/actor_pool_map_operator.py
Outdated
Show resolved
Hide resolved
0d4b23c to
8edacd4
Compare
Signed-off-by: Balaji Veeramani <[email protected]> Add limit Signed-off-by: Balaji Veeramani <[email protected]> Update files Signed-off-by: Balaji Veeramani <[email protected]> Appease lint Signed-off-by: Balaji Veeramani <[email protected]> Fix test Signed-off-by: Balaji Veeramani <[email protected]> Address review comments Signed-off-by: Balaji Veeramani <[email protected]> Address review comments Signed-off-by: Balaji Veeramani <[email protected]> Update stuff Signed-off-by: Balaji Veeramani <[email protected]>
6e9754f to
0babe2d
Compare
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
| * min_actors, | ||
| ) | ||
|
|
||
| return min_resource_usage, ExecutionResources.for_limits() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return min_resource_usage, ExecutionResources.for_limits() | |
| return min_resource_usage, ExecutionResources.inf() |
| """Returns the minimum amount of resources required for execution. | ||
| def min_max_resource_usage_bounds( | ||
| self, | ||
| ) -> Tuple[ExecutionResources, ExecutionResources]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it'd default to not knowing resource reqs
| """Returns the minimum amount of resources required for execution. | ||
| def min_max_resource_usage_bounds( | ||
| self, | ||
| ) -> Tuple[ExecutionResources, ExecutionResources]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, i see that you're defaulting to [0, inf) that's fine too
python/ray/data/_internal/execution/operators/actor_pool_map_operator.py
Show resolved
Hide resolved
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> This change is necessary to ensure we don't over-reserve resources for operators. For example: * If an actor-pool only uses GPU resources, we don't need to reserve CPU resources for it. * If a task-pool has finished receiving inputs and launching tasks, we don't ned to reserve more resources than required for the currently active tasks. ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: jhsu <[email protected]>
#52502 removed the cache in `update_reservation`. now the low-resource warning will spam the console when cluster resources are low. Add a `log_once` check to fix this. --------- Signed-off-by: Hao Chen <[email protected]>
Why are these changes needed?
This change is necessary to ensure we don't over-reserve resources for operators. For example:
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.