-
Notifications
You must be signed in to change notification settings - Fork 616
[Docs] Add API server tuning guide #5176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Queuing requests and polling status asynchronously | ||
-------------------------------------------------- | ||
|
||
There is no limit on the number of queued requests. So in addition to increasing the allocated resources to improve the maximum concurrency, you can also submit requests with ``--async`` flag and poll the status asynchronously to avoid blocking. For example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The queue length is verified in #5175
|
||
sky api cancel <requst_id> | ||
|
||
Avoid concurrent logs requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this chapter after log optimization completed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @aylei! I like the doc. Left some comments.
limits: | ||
cpu: "4" | ||
memory: "8Gi" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to set it, just want to reduce the chance that API server is killed by k8s due to slight out of memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explained in the latter dropdown
d9c95c7e-d248-4a7f-b72e-636511405357 alice sky.jobs.launch a few secs ago PENDING | ||
767182fd-0202-4ae5-b2d7-ddfabea5c821 alice sky.jobs.launch a few secs ago PENDING | ||
5667cff2-e953-4b80-9e5f-546cea83dc59 alice sky.jobs.launch a few secs ago RUNNING | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a title: Check logs for a request
$ sky api logs <request_id> | ||
|
||
If the request is stuck according to the log, e.g. retrying to launch VMs that is out of stock, you can cancel the request with: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a title: Cancel a request
Co-authored-by: Zhanghao Wu <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
@Michaelvll @concretevitamin ping for another look, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @aylei, some comments.
* ``Long-running request``: request that takes long time and more resources to run, including ``launch``, ``exec``, ``jobs.launch``, etc. | ||
* ``Short-running request``: request that takes short time or less resources to run, including ``status``, ``logs``, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* ``Long-running request``: request that takes long time and more resources to run, including ``launch``, ``exec``, ``jobs.launch``, etc. | |
* ``Short-running request``: request that takes short time or less resources to run, including ``status``, ``logs``, etc. | |
* ``Long-running requests``: requests that take longer time and more resources to run, including ``launch``, ``exec``, ``jobs.launch``, etc. | |
* ``Short-running requests``: requests that take shorter time or less resources to run, including ``status``, ``logs``, etc. |
.. note:: | ||
|
||
Though a task (or job) can run for any length of time, concurrent tasks does not occupy the concurrency. Because once a task is submitted to the cluster, it will be detached and no longer takes any resources off the API server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should rephrase this somewhat, and pull it out of note box. How about something like:
Requests are queued and processed by the API server. Therefore, they only take resources off the API server when they are in queue or being processed. Once requests are processed and remote clusters start doing real work, they no longer require API server's resources or count against its concurrency limit.
For example, long-running requests for launch
and exec
no longer take resources off the API server once a cluster has been provisioned, or once a job has been submitted to a cluster, respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check accuracy.
|
||
.. note:: | ||
|
||
If you specify a resources that is lower than the minimum recommended resources (4 CPUs with 8GB of memory) for team usage, an error will be raised on ``helm upgrade``. You can specify ``--set apiService.skipResourcesCheck=true`` to skip the check if performance and stability is not an issue for you scenario. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit confusing that " (4 CPUs with 8GB of memory)" is mentioned as our rec settings, but snippet above doesn't reflect this?
We should mention the rec setting in a 1-sentence paragraph.
Queuing requests and polling status asynchronously | ||
-------------------------------------------------- | ||
|
||
There is no limit on the number of queued requests, i.e. despite increasing the allocated resources to improve the maximum concurrency, you can also submit requests with :ref:`async<async>` (``--async``) and poll the status asynchronously to avoid blocking. For example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no limit on the number of queued requests, i.e. despite increasing the allocated resources to improve the maximum concurrency, you can also submit requests with :ref:`async<async>` (``--async``) and poll the status asynchronously to avoid blocking. For example: | |
There is no limit on the number of queued requests. To avoid request blocking, you can either (1) allocate more resources to increase the maximum concurrency (described above), or (2) :ref:`submit requests asynchronously <async>` (``--async``) and poll the status asynchronously. | |
For example: |
Do we mean this?
I still find this revised section & the sec title confusing. What do we really want to say in this section? Lmk and I can try to rephrase.
It feels like "Use asynchronous requests as much as possible"?
767182fd-0202-4ae5-b2d7-ddfabea5c821 alice sky.jobs.launch a few secs ago PENDING | ||
5667cff2-e953-4b80-9e5f-546cea83dc59 alice sky.jobs.launch a few secs ago RUNNING | ||
|
||
Check logs for a request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check logs for a request | |
Checking the logs of a request |
# Replace <request_id> with the actual request id from the ID column | ||
$ sky api logs <request_id> | ||
|
||
Cancel a request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cancel a request | |
Canceling a request |
Avoid concurrent logs requests | ||
------------------------------ | ||
|
||
If you run ``sky logs`` to tail the logs of a task, the log tailing will keep taking off the resources of the API server as long as the task being tailed is running. So concurrent log requests will occupy the concurrency and make other requests to be delayed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you run ``sky logs`` to tail the logs of a task, the log tailing will keep taking off the resources of the API server as long as the task being tailed is running. So concurrent log requests will occupy the concurrency and make other requests to be delayed. | |
If you run ``sky logs`` to tail the logs of a task, the log tailing will keep taking resources off the API server as long as the task being tailed is still running. Thus, concurrent log requests will occupy the concurrency limit and potentially delay other requests. |
Avoid concurrent logs requests | ||
------------------------------ | ||
|
||
If you run ``sky logs`` to tail the logs of a task, the log tailing will keep taking off the resources of the API server as long as the task being tailed is running. So concurrent log requests will occupy the concurrency and make other requests to be delayed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this section, both 'task' and 'job' are used; can we keep one?
Setups tested:
The concurrency of other resource level is calculated based on our code. All the setups not seeing issue with 1000 concurrent jobs launch requests (10 clients x 100 jobs launch per client) and 10 * 10K status requests.
Tested (run the relevant ones):
bash format.sh
/smoke-test
(CI) orpytest tests/test_smoke.py
(local)/smoke-test -k test_name
(CI) orpytest tests/test_smoke.py::test_name
(local)/quicktest-core
(CI) orpytest tests/smoke_tests/test_backward_compat.py
(local)