Skip to content

Conversation

@saintstack
Copy link
Contributor

@saintstack saintstack commented Jun 17, 2025

Make the scale up of joshua run more aggressively. Was anemic starting 5 pods only per 'session'.... Now we will do up to 100 per session (i.e. each time the script runs -- currenlty every minute).

ps, etc., are helpful navigating/debugging running pod.

Here is some log of these fixes in operation when a big job is in place:

+ batch_size=5
+ max_jobs=10000
+ check_delay=60
+ use_k8s_ttl_controller=false
+ restart_agents_on_boot=false
++ cat /var/run/secrets/kubernetes.io/serviceaccount/namespace
+ namespace=default
+ export AGENT_NAME=joshua-rhel9-agent
+ AGENT_NAME=joshua-rhel9-agent
+ export AGENT_TAG=joshua-agent:latest
+ AGENT_TAG=joshua-agent:latest
+ export FDB_CLUSTER_FILE=/etc/foundationdb/fdb.cluster
+ FDB_CLUSTER_FILE=/etc/foundationdb/fdb.cluster
+ '[' false == true ']'
+ true
+ '[' false == false ']'
++ kubectl get jobs -n default --no-headers
++ grep -E -e '^joshua-rhel9-agent-[0-9]+(-[0-9]+)?\s'
++ awk '$3 == "1/1" {print $1}'
++ echo joshua-rhel9-agent
++ awk -F- '{print NF}'
+ num_hyphen_fields_in_agent_name=3
+ job_prefix_fields=4
++ kubectl get pods -n default --no-headers
++ grep -E '^joshua-rhel9-agent-[0-9]+(-[0-9]+)?-'
++ grep -E -e Completed -e Error
++ cut -f 1-4 -d -
++ true
+ '[' '!' -f /etc/foundationdb/fdb.cluster ']'
++ python3 /tools/ensemble_count.py -C /etc/foundationdb/fdb.cluster
+ num_ensembles=43369
+ echo '43369 ensembles in the queue (global)'
43369 ensembles in the queue (global)
++ kubectl get jobs -n default -o 'jsonpath={range .items[?(@.status.active > 0)]}{.metadata.name}{"\n"}{end}'
++ grep -Ec '^joshua-(rhel9-)?agent-[0-9]+(-[0-9]+)?$'
+ num_all_active_joshua_jobs=4375
+ echo '4375 total active joshua jobs any type are running. Global max_jobs: 10000.'
4375 total active joshua jobs any type are running. Global max_jobs: 10000.
+ new_jobs=0
+ '[' 43369 -gt 0 ']'
+ '[' 4375 -lt 10000 ']'
++ date +%y%m%d%H%M%S
+ current_timestamp=250617152422
+ slots_available_globally=5625
+ num_to_clear_queue=8674
+ num_to_attempt_this_cycle=5625
+ '[' -n 100 ']'
+ '[' 100 -lt 5625 ']'
+ num_to_attempt_this_cycle=100
+ '[' 100 -lt 5 ']'
+ '[' 100 -gt 5625 ']'
+ actual_new_jobs_for_this_scaler=100
+ '[' 100 -gt 0 ']'
+ new_jobs=100
+ idx=0
+ '[' 100 -gt 0 ']'
+ echo 'Starting 100 jobs'
Starting 100 jobs
+ '[' 0 -lt 100 ']'
+ '[' -e /tmp/joshua-agent.yaml ']'
+ rm -f /tmp/joshua-agent.yaml
+ i=0
+ '[' 0 -lt 5 ']'
+ export JOBNAME_SUFFIX=250617152422-0
+ JOBNAME_SUFFIX=250617152422-0
+ echo '=== Adding 250617152422-0 ==='
=== Adding 250617152422-0 ===
+ envsubst
+ echo ---
+ (( idx++ ))
+ (( i++ ))
+ '[' 1 -ge 100 ']'
"/tmp/o.log" 1250L, 32929B

Helpful navigating/debugging running pod.

Make the scale up of joshua run more aggressively. Was anemic starting
5 pods only per 'session'.... Now we will do up to 100 per session
(i.e. each time the script runs -- currenlty every minute).
@saintstack saintstack requested review from jzhou77 and spraza June 17, 2025 15:13
# Determine how many jobs this scaler instance will attempt to start in this cycle.
# We want to be aggressive but also bounded.
# Option 1: How many jobs would we need to clear the queue in one go (using batch_size)?
num_to_clear_queue=$(( (num_ensembles + batch_size - 1) / batch_size ))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing to me. num_to_clear_queue seems to mean how many batches will be needed.
However, below we have num_to_clear_queue < slots_available_globally comparison, where slots_available_globally is the number of jobs, thus comparing batches to jobs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its confusing. Let me try and clean it up.

@saintstack
Copy link
Contributor Author

Address review comments. Cleanup bucket vs job. Tryin explain them better in script.

@saintstack
Copy link
Contributor Author

I deployed the new changes. Seems to be working fine.

Copy link
Contributor

@spraza spraza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@saintstack
Copy link
Contributor Author

@jzhou77 What you think?

Copy link
Contributor

@jzhou77 jzhou77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jzhou77 jzhou77 merged commit ee3b5d6 into FoundationDB:main Jun 24, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants