-
Notifications
You must be signed in to change notification settings - Fork 27
More aggressive scaling up and procps-ng to get ps, top, free, vmstat.... #118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Helpful navigating/debugging running pod. Make the scale up of joshua run more aggressively. Was anemic starting 5 pods only per 'session'.... Now we will do up to 100 per session (i.e. each time the script runs -- currenlty every minute).
k8s/agent-scaler/agent-scaler.sh
Outdated
| # Determine how many jobs this scaler instance will attempt to start in this cycle. | ||
| # We want to be aggressive but also bounded. | ||
| # Option 1: How many jobs would we need to clear the queue in one go (using batch_size)? | ||
| num_to_clear_queue=$(( (num_ensembles + batch_size - 1) / batch_size )) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing to me. num_to_clear_queue seems to mean how many batches will be needed.
However, below we have num_to_clear_queue < slots_available_globally comparison, where slots_available_globally is the number of jobs, thus comparing batches to jobs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its confusing. Let me try and clean it up.
|
Address review comments. Cleanup bucket vs job. Tryin explain them better in script. |
|
I deployed the new changes. Seems to be working fine. |
spraza
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@jzhou77 What you think? |
jzhou77
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Make the scale up of joshua run more aggressively. Was anemic starting 5 pods only per 'session'.... Now we will do up to 100 per session (i.e. each time the script runs -- currenlty every minute).
ps, etc., are helpful navigating/debugging running pod.
Here is some log of these fixes in operation when a big job is in place: