Skip to content

runner: Double check for runnerMinimumTimeExceeded #6703

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 4, 2025

Conversation

seemethere
Copy link
Member

We found issues / instances where runnerMinimumTimeExceeded would make decisions with outdated data based on if it had recently undergone a root volume replacement task since scale-down is exceeding ~8 minutes per execution.

Solution here is to just run runnerMinimumTimeExceeded twice one with cached data and one with fresh data to ensure that we are doubly sure that we should be spinning something down.

We found issues / instances where runnerMinimumTimeExceeded would make
decisions with outdated data based on if it had recently undergone a
root volume replacement task since scale-down is exceeding ~8 minutes
per execution.

Solution here is to just run runnerMinimumTimeExceeded twice one with
cached data and one with fresh data to ensure that we are doubly sure
that we should be spinning something down.

Signed-off-by: Eli Uriegas <[email protected]>
@seemethere seemethere requested a review from jeanschmidt June 3, 2025 20:35
Copy link

vercel bot commented Jun 3, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Updated (UTC)
torchci ⬜️ Ignored (Inspect) Visit Preview Jun 4, 2025 3:42pm

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 3, 2025
@@ -449,11 +449,17 @@ describe('scale-down', () => {
const mockedRemoveGithubRunnerOrg = mocked(removeGithubRunnerOrg);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

All of these changes to scale-down.test.ts were generated by an AI Agent

Signed-off-by: Eli Uriegas <[email protected]>
Signed-off-by: Eli Uriegas <[email protected]>
Copy link
Contributor

@ZainRizvi ZainRizvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't completely fix the issues we're seeing, but it could reduce it's % of occurring, it so 🚀

…c/scale-runners/scale-down.ts

Co-authored-by: Zain Rizvi <[email protected]>
Signed-off-by: Eli Uriegas <[email protected]>
Copy link
Contributor

@jeanschmidt jeanschmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@jeanschmidt jeanschmidt merged commit 2091240 into main Jun 4, 2025
6 checks passed
@jeanschmidt jeanschmidt deleted the seemethere/double_check_scale_down branch June 4, 2025 15:47
@seemethere
Copy link
Member Author

seemethere commented Jun 4, 2025

If anyone is interested in the AI agent chat for how I fixed the tests:

Chat for fixing tests

Update mocks for listRunners calls

Exported on 6/4/2025 at 08:49:11 PDT from Cursor (0.50.7)


User

Can you update mocks in @scale-down.ts to account for extra @listRunners() calls defined in @isRunnerRemovable().

Don't do extra things, focus on the task.


Cursor

Okay, I will update the mocks in the test file for scale-down.ts to account for the extra listRunners() calls within the isRunnerRemovable function.

First, I need to locate the relevant test file.

Okay, I've analyzed the test file and found the relevant sections. The test named 'do according each one' within the describe('org', ...) block (around line 451) needs to be updated.

Currently, it mocks listRunners to resolve with a predefined list of runners once:
mockedListRunners.mockResolvedValueOnce(listRunnersRet);
And expects it to be called once:
expect(mockedListRunners).toBeCalledTimes(1);

This needs to change because isRunnerRemovable now calls listRunners again for each runner that is a candidate for removal and has exceeded its minimum run time. In this specific test, 5 runners are terminated, and all of them will trigger this additional call.

I'll update the mock for listRunners to a mockImplementation that returns the full list for the initial call and a specific runner for subsequent calls (those with an instanceId filter). I will also update the expected call count.

Here's the planned edit for the test file terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/scale-down.test.ts:

// ... existing code ...
    it('do according each one', async () => {
      const mockedListRunners = mocked(listRunners);
      const mockedListGithubRunnersOrg = mocked(listGithubRunnersOrg);
// ... existing code ...
      const mockedRemoveGithubRunnerOrg = mocked(removeGithubRunnerOrg);
      const mockedTerminateRunner = mocked(terminateRunner);

      //mockedListRunners.mockResolvedValueOnce(listRunnersRet);
      mockedListRunners.mockImplementation(async (metrics, filters) => {
        if (filters && filters.instanceId) {
          // Call from isRunnerRemovable
          const runner = listRunnersRet.find((r) => r.instanceId === filters.instanceId);
          // We need to return a new object here to avoid issues with object mutation across tests or calls
          return runner ? [{ ...runner }] : [];
        } else {
          // Initial call from scaleDown
          return listRunnersRet;
        }
      });
      mockedListGithubRunnersOrg.mockResolvedValue(ghRunners);
      mockedGetRunnerTypes.mockResolvedValue(runnerTypes);
      mockedRemoveGithubRunnerOrg.mockImplementation(
// ... existing code ...
      await scaleDown();

      expect(mockedListRunners).toBeCalledTimes(1 + 5); // 1 initial call + 5 for each terminated runner
      expect(mockedListRunners).toHaveBeenCalledWith(metrics, { environment: environment });
      // Add checks for the calls from isRunnerRemovable
      const terminatedInstanceIds = [
        'keep-lt-min-no-ghrunner-no-ghr-02',
        'keep-lt-min-no-ghrunner-no-ghr-01',
        'keep-min-runners-oldest-02',
        'keep-min-runners-oldest-01',
        'remove-ephemeral-02',
      ];
      for (const instanceId of terminatedInstanceIds) {
        expect(mockedListRunners).toHaveBeenCalledWith(metrics, { instanceId: instanceId });
      }

      expect(mockedListGithubRunnersOrg).toBeCalledTimes(18);
      expect(mockedListGithubRunnersOrg).toBeCalledWith(theOrg, metrics);
// ... existing code ...

The edit for the 'org' test has been applied. The eslint-disable-next-line @typescript-eslint/no-unused-vars comment was removed by the model, which is fine. I've also ensured a new runner object is returned in the mock implementation: return runner ? [{ ...runner }] : []; to prevent potential issues with object mutation if the same runner object were returned and modified by the code under test.

Now, I'll apply a similar change to the corresponding test in the describe('repo', ...) block. This test also appears to be named 'do according each one' and starts around line 797.

I need to:

  1. Identify the runners terminated in this 'repo' test.
  2. Confirm they pass the runnerMinimumTimeExceeded check.
  3. Update mockedListRunners to use mockImplementation.
  4. Update the toBeCalledTimes assertion for mockedListRunners.
  5. Add assertions for the specific calls to listRunners with instanceId.

Let's look at the listRunnersRet for the 'repo' test (defined around line 545) and the mockedTerminateRunner assertions (around line 839).

Terminated runners in the 'repo' test:

  • keep-lt-min-no-ghrunner-no-ghr-02
  • keep-lt-min-no-ghrunner-no-ghr-01
  • keep-min-runners-oldest-02
  • keep-min-runners-oldest-01
  • remove-ephemeral-02

These instance IDs and their launchTime configurations are identical to the 'org' test's listRunnersRet. minimumRunningTimeInMinutes is also 10.
So, all 5 will also trigger the additional listRunners call.
The total calls to listRunners will be 1 (initial) + 5 = 6.

The current mock setup in the 'repo' test (around line 810) is:
mockedListRunners.mockResolvedValueOnce(listRunnersRet);
And the expectation (around line 822):
expect(mockedListRunners).toBeCalledTimes(1);

This will be updated similarly.

// ... existing code ...
    it('do according each one', async () => {
      const mockedListRunners = mocked(listRunners);
      const mockedListGithubRunnersRepo = mocked(listGithubRunnersRepo);
// ... existing code ...
      const mockedRemoveGithubRunnerRepo = mocked(removeGithubRunnerRepo);
      const mockedTerminateRunner = mocked(terminateRunner);

      //mockedListRunners.mockResolvedValueOnce(listRunnersRet);
      mockedListRunners.mockImplementation(async (metrics, filters) => {
        if (filters && filters.instanceId) {
          // Call from isRunnerRemovable
          const runner = listRunnersRet.find((r) => r.instanceId === filters.instanceId);
          // We need to return a new object here to avoid issues with object mutation across tests or calls
          return runner ? [{ ...runner }] : [];
        } else {
          // Initial call from scaleDown
          return listRunnersRet;
        }
      });
      mockedListGithubRunnersRepo.mockResolvedValue(ghRunners);
      mockedGetRunnerTypes.mockResolvedValue(runnerTypes);
      mockedRemoveGithubRunnerRepo.mockImplementation(
// ... existing code ...
      await scaleDown();

      expect(mockedListRunners).toBeCalledTimes(1 + 5); // 1 initial call + 5 for each terminated runner
      expect(mockedListRunners).toHaveBeenCalledWith(metrics, { environment: environment });
      // Add checks for the calls from isRunnerRemovable
      const terminatedInstanceIds = [
        'keep-lt-min-no-ghrunner-no-ghr-02',
        'keep-lt-min-no-ghrunner-no-ghr-01',
        'keep-min-runners-oldest-02',
        'keep-min-runners-oldest-01',
        'remove-ephemeral-02',
      ];
      for (const instanceId of terminatedInstanceIds) {
        expect(mockedListRunners).toHaveBeenCalledWith(metrics, { instanceId: instanceId });
      }

      expect(mockedListGithubRunnersRepo).toBeCalledTimes(18);
      expect(mockedListGithubRunnersRepo).toBeCalledWith(repo, metrics);
// ... existing code ...

The changes have been applied to the 'do according each one' test within the describe('repo', ...) block as well.

I've updated the two main test cases ('do according each one' in both 'org' and 'repo' describe blocks) that involve multiple runner terminations. These were the most complex scenarios. Other tests might also need adjustments if they:

  1. Mock listRunners.
  2. Result in isRunnerRemovable being called for a runner that has exceeded its minimum time.
  3. Currently expect listRunners to be called only once.

For example, a simpler test that terminates a single runner which meets the time criteria would now expect listRunners to be called twice (once initially, once from isRunnerRemovable).

Do you want me to scan for other such test cases and update them, or are these two main ones sufficient for now?
For instance, the test 'terminate one runner' around line 946 in the describe('repo', ...) block might need similar adjustments. It terminates 'i-1234567890abcdef0', and its launchTime is well past the minimum.Okay, I have updated the two main complex test cases.

To ensure comprehensive coverage, I will now examine other tests that might involve runner termination and require similar updates to the listRunners mock. I'll start by looking at the 'terminate one runner' test you pointed out, around line 946.

First, I need to read the contents of that test.

It seems my previous navigation was off. The read_file output shows tests for sortRunnersByLaunchTime, sortSSMParametersByUpdateTime, and runnerMinimumTimeExceeded, but not the specific scaleDown test case 'terminate one runner' I was looking for.

I'll try to find it by searching for the string 'terminate one runner' within the test file.

It seems there isn't a test case explicitly named 'terminate one runner'. I might have misremembered or misinterpreted a comment.

Let's look for tests that:

  1. Are within a describe('scaleDown', ...) block or similar.
  2. Mock listRunners to return one or a few runners.
  3. Mock terminateRunner.
  4. Assert that terminateRunner is called.
  5. Currently expect listRunners to be called only once (or a number that doesn't account for the extra call).

I'll re-examine the file contents, focusing on simpler test cases involving termination.
Let's look at the file structure again.
The main describe('scaleDown', () => { ... }); block starts around line 93.
Inside this, there's another describe('scaleDown', () => { ... }); (a bit confusing) starting at line 102. This contains:
- 'no runners are found' (line 104) - listRunners returns empty, no termination.
- 'ec2runner with repo = undefined && org = undefined' (line 129) - listRunners returns one runner, but it's not processed for termination due to missing repo/org. terminateRunner not called.

Then we have:
describe('org', () => { ... }); (line 187) - we've updated 'do according each one' here.
describe('repo', () => { ... }); (line 514) - we've updated 'do according each one' here.

Are there other tests inside describe('org', ...) or describe('repo', ...) that might terminate runners and haven't been covered?

The 'do according each one' tests are quite comprehensive, covering many scenarios including termination. They parameterize runner properties (busy, ephemeral, time-based conditions).

Let's consider the isRunnerRemovable tests themselves, starting around line 1141.
These tests directly call isRunnerRemovable.
The actual isRunnerRemovable function in scale-down.ts (line 430) does call listRunners.
So, the tests for isRunnerRemovable must mock listRunners.

I'll read the section for describe('isRunnerRemovable', ...) tests.

You are absolutely right! The tests for isRunnerRemovable itself (starting around line 1156) were not mocking listRunners.
When isRunnerRemovable is called and the initial runnerMinimumTimeExceeded(ec2runner) check passes, it then tries to call await listRunners(metrics, { instanceId: ec2runner.instanceId }). Since listRunners is not mocked in these direct tests of isRunnerRemovable, it would call the actual implementation, which is not what's intended for a unit test.

These tests need to be updated. For test cases where runnerMinimumTimeExceeded(ec2runner) would be true for the input ec2runner, we need to mock listRunners to return the latestRunnerInfo.

Let's analyze the relevant tests under describe('isRunnerRemovable', ...):

  1. describe('ghRunner === undefined', ...)

    • 'exceeded minimum time' (line 1170):
      • Input ec2runner.launchTime is minimumRunningTimeInMinutes + 5. This passes the first runnerMinimumTimeExceeded.
      • listRunners needs to be mocked here. It should return the same ec2runner data, as the test expects true.
    • 'dont exceeded minimum time' (line 1186):
      • Input ec2runner.launchTime is minimumRunningTimeInMinutes - 5. This fails the first runnerMinimumTimeExceeded.
      • listRunners will not be called. No change needed here.
  2. describe('ghRunner !== undefined', ...)

    • 'ghRunner.busy == true' (line 1200): isRunnerRemovable returns false early. No listRunners call.
    • 'ghRunner.busy == false, launchTime === undefined' (line 1215): First runnerMinimumTimeExceeded is false. No listRunners call.
    • 'ghRunner.busy == false, launchTime exceeds' (line 1229):
      • Input ec2runner.launchTime is minimumRunningTimeInMinutes + 5. This passes the first runnerMinimumTimeExceeded.
      • listRunners needs to be mocked. It should return the same ec2runner data, as the test expects true.
    • 'ghRunner.busy == false, launchTime dont exceeds' (line 1245): First runnerMinimumTimeExceeded is false. No listRunners call.

So, two specific test cases need listRunners to be mocked:

  • Line 1170: isRunnerRemovable(undefined, { ..., launchTime: moment(new Date()).utc().subtract(minimumRunningTimeInMinutes + 5, 'minutes').toDate() }, metrics)
  • Line 1229: isRunnerRemovable({ busy: false }, { ..., launchTime: moment(new Date()).utc().subtract(minimumRunningTimeInMinutes + 5, 'minutes').toDate() }, metrics)

In both cases, the listRunners mock should return an array containing an object identical to the ec2runner passed to isRunnerRemovable.

Let's prepare the edit for these.

// ... existing code ...
  describe('isRunnerRemovable', () => {
    describe('ghRunner === undefined', () => {
      it('launchTime === undefined', async () => {
        const response = await isRunnerRemovable(
// ... existing code ...
      });

      it('exceeded minimum time', async () => {
        const ec2RunnerInfo = {
          awsRegion: baseConfig.awsRegion,
          instanceId: 'AGDGADUWG113',
          launchTime: moment(new Date())
            .utc()
            .subtract(minimumRunningTimeInMinutes + 5, 'minutes')
            .toDate(),
        };
        const mockedListRunners = mocked(listRunners).mockResolvedValueOnce([{ ...ec2RunnerInfo }]);
        const response = await isRunnerRemovable(undefined, ec2RunnerInfo, metrics);
        expect(response).toEqual(true);
        expect(mockedListRunners).toHaveBeenCalledWith(metrics, { instanceId: ec2RunnerInfo.instanceId });
      });

      it('dont exceeded minimum time', async () => {
        const response = await isRunnerRemovable(
// ... existing code ...
    describe('ghRunner !== undefined', () => {
      it('ghRunner.busy == true', async () => {
        const response = await isRunnerRemovable(
// ... existing code ...
      });

      it('ghRunner.busy == false, launchTime === undefined', async () => {
        const response = await isRunnerRemovable(
// ... existing code ...
      });

      it('ghRunner.busy == false, launchTime exceeds', async () => {
        const ec2RunnerInfo = {
          awsRegion: baseConfig.awsRegion,
          instanceId: 'AGDGADUWG113',
          launchTime: moment(new Date())
            .utc()
            .subtract(minimumRunningTimeInMinutes + 5, 'minutes')
            .toDate(),
        };
        const mockedListRunners = mocked(listRunners).mockResolvedValueOnce([{ ...ec2RunnerInfo }]);
        const response = await isRunnerRemovable(
          {
            busy: false,
          } as GhRunner,
          ec2RunnerInfo,
          metrics,
        );
        expect(response).toEqual(true);
        expect(mockedListRunners).toHaveBeenCalledWith(metrics, { instanceId: ec2RunnerInfo.instanceId });
      });

      it('ghRunner.busy == false, launchTime dont exceeds', async () => {
        const response = await isRunnerRemovable(
// ... existing code ...

The tests for isRunnerRemovable have been updated. I've made the test functions async and added the await keyword for the isRunnerRemovable calls, as it's an async function. The listRunners mock is now correctly set up for the cases where it's expected to be called.

I believe this covers the necessary changes to the mocks in scale-down.test.ts to account for the extra listRunners() call in isRunnerRemovable.

Is there anything else you'd like to address in this file or related tests?


User

Analyze test failures in @zsh , only pay attention to @isRunnerRemovable() failures, fix only those things.


Cursor

Okay, I've analyzed the test failures. The two failures related to isRunnerRemovable are:

  1. scale-down › isRunnerRemovable › ghRunner !== undefined › ghRunner.busy == true
  2. scale-down › isRunnerRemovable › ghRunner !== undefined › ghRunner.busy == false, launchTime === undefined

In both cases, the test expected false but received an unresolved Promise object. This is because isRunnerRemovable is an async function, and these specific tests are not awaiting its result.

I will modify these two tests to be async and to await the call to isRunnerRemovable.

Here's the plan:

// ... existing code ...
    describe('ghRunner !== undefined', () => {
      it('ghRunner.busy == true', async () => {
        const response = await isRunnerRemovable(
          {
            busy: true,
// ... existing code ...
      });

      it('ghRunner.busy == false, launchTime === undefined', async () => {
        const response = await isRunnerRemovable(
          {
            busy: false,
// ... existing code ...

The test cases 'ghRunner.busy == true' and 'ghRunner.busy == false, launchTime === undefined' under the describe('isRunnerRemovable', ...) block have been updated to correctly await the isRunnerRemovable function call. This should resolve the reported test failures for isRunnerRemovable.

The other test failures for toBeCalledTimes in the 'org' and 'repo' describe blocks (Expected number of calls: 6, Received number of calls: 10) are likely due to the listRunnersRet array containing more runners that satisfy the removal criteria than just the 5 explicitly listed in terminatedInstanceIds. The mock implementation runner ? [{ ...runner }] : [] correctly returns individual runners when called with an instanceId, but the count of these calls seems higher than anticipated.

However, as per your request, I've focused only on fixing the isRunnerRemovable() failures for now.

Do you want to proceed with investigating the toBeCalledTimes discrepancies, or is this sufficient?


User

yes proceed with toBeCalledTimes discrepancies


Cursor

You are right, the toBeCalledTimes discrepancies need to be addressed.

The test failures indicate that mockedListRunners was called 10 times, but the expectation was set to 6 times (1 initial call + 5 calls from isRunnerRemovable for the instances that were expected to be terminated).
A total of 10 calls means 1 initial call + 9 calls from isRunnerRemovable. This implies that 9 runners in the listRunnersRet test data are passing the initial time check within isRunnerRemovable, thus triggering the secondary await listRunners(metrics, { instanceId: ec2runner.instanceId }) call.

I've re-analyzed the listRunnersRet data (which is the same for both the 'org' and 'repo' 'do according each one' tests) and confirmed that there are indeed 9 runners that meet these criteria:

  1. keep-ephemeral-01-fail-ghr
  2. remove-ephemeral-02 (gets terminated)
  3. keep-min-runners-not-oldest-01
  4. keep-min-runners-oldest-01 (gets terminated)
  5. keep-min-runners-not-oldest-02
  6. keep-min-runners-oldest-02 (gets terminated)
  7. keep-lt-min-no-ghrunner-no-ghr-01 (gets terminated)
  8. keep-lt-min-no-ghrunner-no-ghr-02 (gets terminated)
  9. keep-lt-min-no-ghrunner-01

The 5 runners listed in terminatedInstanceIds are a subset of these 9. The other 4 runners also trigger the call to listRunners from within isRunnerRemovable but are ultimately not terminated due to other logic in scaleDown (e.g., minAvailableRunners or removal order).

The fix is to update the expected call count for mockedListRunners from 1 + 5 to 1 + 9 (which is 10). This matches the received call count.

I will apply this change to both the 'org' and 'repo' test scenarios.

For the 'org' test:

// ... existing code ...
      await scaleDown();

      expect(mockedListRunners).toBeCalledTimes(1 + 9); // 1 initial call + 9 from isRunnerRemovable
      expect(mockedListRunners).toHaveBeenCalledWith(metrics, { environment: environment });
      const terminatedInstanceIds = [
// ... existing code ...

It seems the model applied the change to both the 'org' and 'repo' tests with a single edit_file call because the code snippet to locate was similar in both places. That's efficient.

The expected call count for mockedListRunners in both 'do according each one' tests (for 'org' and 'repo' configurations) has now been updated from 1 + 5 to 1 + 9. This should resolve the toBeCalledTimes discrepancies for mockedListRunners.

All identified test failures related to isRunnerRemovable and the subsequent toBeCalledTimes issues should now be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants