More aggressive scaling up and procps-ng to get ps, top, free, vmstat.... #118

saintstack · 2025-06-17T15:12:45Z

Make the scale up of joshua run more aggressively. Was anemic starting 5 pods only per 'session'.... Now we will do up to 100 per session (i.e. each time the script runs -- currenlty every minute).

ps, etc., are helpful navigating/debugging running pod.

Here is some log of these fixes in operation when a big job is in place:

+ batch_size=5
+ max_jobs=10000
+ check_delay=60
+ use_k8s_ttl_controller=false
+ restart_agents_on_boot=false
++ cat /var/run/secrets/kubernetes.io/serviceaccount/namespace
+ namespace=default
+ export AGENT_NAME=joshua-rhel9-agent
+ AGENT_NAME=joshua-rhel9-agent
+ export AGENT_TAG=joshua-agent:latest
+ AGENT_TAG=joshua-agent:latest
+ export FDB_CLUSTER_FILE=/etc/foundationdb/fdb.cluster
+ FDB_CLUSTER_FILE=/etc/foundationdb/fdb.cluster
+ '[' false == true ']'
+ true
+ '[' false == false ']'
++ kubectl get jobs -n default --no-headers
++ grep -E -e '^joshua-rhel9-agent-[0-9]+(-[0-9]+)?\s'
++ awk '$3 == "1/1" {print $1}'
++ echo joshua-rhel9-agent
++ awk -F- '{print NF}'
+ num_hyphen_fields_in_agent_name=3
+ job_prefix_fields=4
++ kubectl get pods -n default --no-headers
++ grep -E '^joshua-rhel9-agent-[0-9]+(-[0-9]+)?-'
++ grep -E -e Completed -e Error
++ cut -f 1-4 -d -
++ true
+ '[' '!' -f /etc/foundationdb/fdb.cluster ']'
++ python3 /tools/ensemble_count.py -C /etc/foundationdb/fdb.cluster
+ num_ensembles=43369
+ echo '43369 ensembles in the queue (global)'
43369 ensembles in the queue (global)
++ kubectl get jobs -n default -o 'jsonpath={range .items[?(@.status.active > 0)]}{.metadata.name}{"\n"}{end}'
++ grep -Ec '^joshua-(rhel9-)?agent-[0-9]+(-[0-9]+)?$'
+ num_all_active_joshua_jobs=4375
+ echo '4375 total active joshua jobs any type are running. Global max_jobs: 10000.'
4375 total active joshua jobs any type are running. Global max_jobs: 10000.
+ new_jobs=0
+ '[' 43369 -gt 0 ']'
+ '[' 4375 -lt 10000 ']'
++ date +%y%m%d%H%M%S
+ current_timestamp=250617152422
+ slots_available_globally=5625
+ num_to_clear_queue=8674
+ num_to_attempt_this_cycle=5625
+ '[' -n 100 ']'
+ '[' 100 -lt 5625 ']'
+ num_to_attempt_this_cycle=100
+ '[' 100 -lt 5 ']'
+ '[' 100 -gt 5625 ']'
+ actual_new_jobs_for_this_scaler=100
+ '[' 100 -gt 0 ']'
+ new_jobs=100
+ idx=0
+ '[' 100 -gt 0 ']'
+ echo 'Starting 100 jobs'
Starting 100 jobs
+ '[' 0 -lt 100 ']'
+ '[' -e /tmp/joshua-agent.yaml ']'
+ rm -f /tmp/joshua-agent.yaml
+ i=0
+ '[' 0 -lt 5 ']'
+ export JOBNAME_SUFFIX=250617152422-0
+ JOBNAME_SUFFIX=250617152422-0
+ echo '=== Adding 250617152422-0 ==='
=== Adding 250617152422-0 ===
+ envsubst
+ echo ---
+ (( idx++ ))
+ (( i++ ))
+ '[' 1 -ge 100 ']'
"/tmp/o.log" 1250L, 32929B

Helpful navigating/debugging running pod. Make the scale up of joshua run more aggressively. Was anemic starting 5 pods only per 'session'.... Now we will do up to 100 per session (i.e. each time the script runs -- currenlty every minute).

jzhou77 · 2025-06-17T17:21:06Z

k8s/agent-scaler/agent-scaler.sh

+        # Determine how many jobs this scaler instance will attempt to start in this cycle.
+        # We want to be aggressive but also bounded.
+        # Option 1: How many jobs would we need to clear the queue in one go (using batch_size)?
+        num_to_clear_queue=$(( (num_ensembles + batch_size - 1) / batch_size ))


This is confusing to me. num_to_clear_queue seems to mean how many batches will be needed.
However, below we have num_to_clear_queue < slots_available_globally comparison, where slots_available_globally is the number of jobs, thus comparing batches to jobs?

Its confusing. Let me try and clean it up.

saintstack · 2025-06-17T21:21:10Z

Address review comments. Cleanup bucket vs job. Tryin explain them better in script.

saintstack · 2025-06-17T23:15:15Z

I deployed the new changes. Seems to be working fine.

spraza

LGTM

saintstack · 2025-06-24T03:55:05Z

@jzhou77 What you think?

jzhou77

LGTM

saintstack requested review from jzhou77 and spraza June 17, 2025 15:13

jzhou77 reviewed Jun 17, 2025

View reviewed changes

Address reviews: Cleanup bucket vs job.

0c36317

spraza approved these changes Jun 17, 2025

View reviewed changes

jzhou77 approved these changes Jun 24, 2025

View reviewed changes

jzhou77 merged commit ee3b5d6 into FoundationDB:main Jun 24, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

More aggressive scaling up and procps-ng to get ps, top, free, vmstat.... #118

More aggressive scaling up and procps-ng to get ps, top, free, vmstat.... #118

Uh oh!

saintstack commented Jun 17, 2025 •

edited

Loading

Uh oh!

jzhou77 Jun 17, 2025

Uh oh!

saintstack Jun 17, 2025

Uh oh!

saintstack commented Jun 17, 2025

Uh oh!

saintstack commented Jun 17, 2025

Uh oh!

spraza left a comment

Uh oh!

saintstack commented Jun 24, 2025

Uh oh!

jzhou77 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

More aggressive scaling up and procps-ng to get ps, top, free, vmstat.... #118

More aggressive scaling up and procps-ng to get ps, top, free, vmstat.... #118

Uh oh!

Conversation

saintstack commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jzhou77 Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

saintstack Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

saintstack commented Jun 17, 2025

Uh oh!

saintstack commented Jun 17, 2025

Uh oh!

spraza left a comment

Choose a reason for hiding this comment

Uh oh!

saintstack commented Jun 24, 2025

Uh oh!

jzhou77 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

saintstack commented Jun 17, 2025 •

edited

Loading