Skip to content

[Docs] Add API server tuning guide #5176

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 27, 2025
Merged

[Docs] Add API server tuning guide #5176

merged 10 commits into from
Apr 27, 2025

Conversation

aylei
Copy link
Collaborator

@aylei aylei commented Apr 10, 2025

Setups tested:

  • 4c8g (our internal API server also use this setup)
  • 8c16g
  • 16c32g
  • 128c256g

The concurrency of other resource level is calculated based on our code. All the setups not seeing issue with 1000 concurrent jobs launch requests (10 clients x 100 jobs launch per client) and 10 * 10K status requests.

128c256g encountered the ulimit issue due to high executor numbers #5174, addressed after tuning, this should be fixed in another PR so we don't have to mention this in our doc here.

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

Queuing requests and polling status asynchronously
--------------------------------------------------

There is no limit on the number of queued requests. So in addition to increasing the allocated resources to improve the maximum concurrency, you can also submit requests with ``--async`` flag and poll the status asynchronously to avoid blocking. For example:
Copy link
Collaborator Author

@aylei aylei Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The queue length is verified in #5175


sky api cancel <requst_id>

Avoid concurrent logs requests
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this chapter after log optimization completed

@aylei aylei marked this pull request as ready for review April 11, 2025 10:55
@Michaelvll Michaelvll removed the request for review from romilbhardwaj April 17, 2025 02:17
@Michaelvll
Copy link
Collaborator

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aylei! I like the doc. Left some comments.

aylei and others added 3 commits April 18, 2025 16:28
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
@aylei aylei requested a review from Michaelvll April 18, 2025 10:16
Signed-off-by: Aylei <[email protected]>
@aylei
Copy link
Collaborator Author

aylei commented Apr 21, 2025

@Michaelvll @concretevitamin ping for another look, thanks!

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aylei, some comments.

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder on the few open items before merging ;)

@aylei
Copy link
Collaborator Author

aylei commented Apr 27, 2025

Reminder on the few open items before merging ;)

Thanks! Dived into log optimization these days, just fixed all the issues we've discussed, merging.

@aylei aylei merged commit 1bfa89c into master Apr 27, 2025
22 checks passed
@aylei aylei deleted the apiserver-tuning branch April 27, 2025 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants