Skip to content

Conversation

@saintstack
Copy link
Contributor

@saintstack saintstack commented Jan 6, 2026

  • Fix should_run_ensemble to check completed runs instead of started count
  • Add AGENT_TIMEOUT environment variable support with graceful shutdown
  • Add timestamped logging for better debugging

This was an interesting one. Some of the nightly jobs were failing. They were 'timing out' running joshua tests. In the nightly report, the jobs would be 'in progress' usually and then much later in the morning report as joshua test runs timed out with NO results. Looking at the joshua cluster that runs the nightlies, joshua-agent pod counts had us pegged up at near the 10k limit. Poking around it seemed odd... pods didn't seem to be doing anything.

Turns out ensembles stick around in the database for a while after they finish -- 7 days -- but for some of the ensembles, even though they were 'done', they would report that they were still 'alive' ("Can run: True "). This would happen when the start count fell below max_runs count which could happen if an agent died for whatever reason (many): when an agent dies, start is decremented in the cleanup and should_run_ensemble method in joshua_model.py would return 'True'.

The cluster had 70 odd ensembles when I went to look at it. They were left over mostly from December 30/31st. These were reporting they had work still to be done. So agents and scheduler would asking the ensemble for work... but there was none. This happened thousands of times. This behavior kept the pod count elevated.

joshua_model.py is used by two images... joshua-agent and agent-scaler. I deployed both with the fixes here to see how they do. With the fixes deployed, the problem ensembles are not longer reporting 'True' out of should_run_ensemble for these old ensembles queued last year.

While in here updated logging to include timestamp and read a timeout environment varible that was previously ignored (nit).

- Fix should_run_ensemble to check completed runs instead of started count
- Add AGENT_TIMEOUT environment variable support with graceful shutdown
- Add timestamped logging for better debugging
@saintstack saintstack requested a review from johscheuer January 6, 2026 18:37
@johscheuer johscheuer closed this Jan 7, 2026
@johscheuer johscheuer reopened this Jan 7, 2026
Copy link
Member

@johscheuer johscheuer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment about the redundant import.

def log(outputText, newline=True):
import datetime
timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
message = f"[{timestamp}] {outputText}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to use the logging package? That would add the timestamp. Probably more refactoring?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that would be better but was thinking minimal change. Should I go for it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do that in another PR if we think it's worth to refactor :)

@saintstack
Copy link
Contributor Author

Address review comment (and added a few timestamps to agent-scaler.sh ouputs)

Copy link
Member

@johscheuer johscheuer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

def log(outputText, newline=True):
import datetime
timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
message = f"[{timestamp}] {outputText}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do that in another PR if we think it's worth to refactor :)

@saintstack saintstack merged commit a282dab into FoundationDB:main Jan 8, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants