Skip to content

Conversation

@sbulage
Copy link
Collaborator

@sbulage sbulage commented Nov 25, 2025

In recent days, we have seen that test Fleet-51 is failing multiple times on CI.

Problem:

What should happen:

  • Fleet-51 test case does:
    • create a new workspace in cluster
    • move one of the cluster to the newly created workspace
    • create GitRepo in the newly created workspace and verify the resources
    • move back the cluster to it's default workspace.
    • Delete newly create workspace and GitRepo.

What is happening:

  • Currently Fleet-51 test:

    • Creating new workspace
    • Trying to move one of cluster to newly created workspace but it took more time and test fails with timeout.
    • Moved cluster stays in newly created workspace.
  • Currently Fleet-156 is also failing due to

    • Not waiting enough to load page content

Solution:

  • Increase timeout
  • Move test case to special tests specs
  • Run tests at very end of all tests to no other Test Failures
  • For test Fleet-156, add wait to load page properly for further testing.

@sbulage sbulage self-assigned this Nov 25, 2025
@sbulage sbulage added fleet-e2e-ci Improvements or additions to the CI framework automation Add or update automation labels Nov 25, 2025
@sbulage sbulage linked an issue Nov 25, 2025 that may be closed by this pull request
@sbulage
Copy link
Collaborator Author

sbulage commented Nov 25, 2025

@sbulage
Copy link
Collaborator Author

sbulage commented Nov 25, 2025

Second Try with All tests (using all tags)

@sbulage sbulage requested a review from mmartin24 November 25, 2025 14:01
// Verify is gitrepoJobsCleanup is enabled by default.
cy.accesMenuSelection('local', 'Workloads', 'CronJobs');
// Adding wait for CronJobs page to load correctly.
cy.wait(5000);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 seconds is needed?
Doesn't it display with 2 more only?
Same for the examples below

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried varying seconds as low as I can but I find optimal 5 seconds to load page correctly. But still there is room for better logic as cy.wait() is not final solution.

@mmartin24
Copy link
Collaborator

I get test 51 is being moved, but it is fixed?

@sbulage
Copy link
Collaborator Author

sbulage commented Nov 25, 2025

I get test 51 is being moved, but it is fixed?

Yes, you can see the 2 retries of 2.12-head CI run proves it. Also, I mentioned the whole test execution and where it is failing.

@mmartin24
Copy link
Collaborator

I get test 51 is being moved, but it is fixed?

Yes, you can see the 2 retries of 2.12-head CI run proves it. Also, I mentioned the whole test execution and where it is failing.

The test passes when tried solely the special_tests sets. We have had in the past this test passing when the p1_2 was run alone. That is not proof itself. Full runs are more accurate as they carry more processes not being settled while executing the full run, but still this is not explanation of what was wrong.

Feel free to merge and we will see, but I am not sure I undertand what caused it. It is ok if we don't and we try simply not to have other tests affected by encapsulating it somewhere else. But my question remains: is it fixed? and if so, what caused it?

@sbulage
Copy link
Collaborator Author

sbulage commented Nov 25, 2025

I get test 51 is being moved, but it is fixed?

Yes, you can see the 2 retries of 2.12-head CI run proves it. Also, I mentioned the whole test execution and where it is failing.

The test passes when tried solely the special_tests sets. We have had in the past this test passing when the p1_2 was run alone. That is not proof itself. Full runs are more accurate as they carry more processes not being settled while executing the full run, but still this is not explanation of what was wrong.

  • Yes, I totally get that full run will ensure more reliability of failed tests are passing. Link of full test run: Updated timeouts and moved test Fleet-51 to special #422 (comment)

  • Regarding why Fleet-51 failure cause:

    • When cluster (imported-0) moved to newly created workspace it take around 30sec of time (before increase in timeout).

    • After 30 seconds timeout over, cluster is still not fully moved to newly created workspace.

    • Next tests step executed i.e. GitRepo creation, which shows resource count (6/6) in the UI but bundle count was not loaded.

    • And the test is failing. (see below screenshot).

    • In the screenshot, we can clearly see bundle count wasn't shown there but we see resource count.

      Screenshot showing resource count but not bundle count

      Test move cluster to newly created workspace and deploy application to it -- Fleet-51 Test move cluster to newly created workspace and deploy application to it  (Qase ID 51) (failed)

  • What is fixed for test Fleet-51:

    • Increasing timeout from 30000 to 40000/60000 to 70000, gives extra time for cluster to be settle properly.
    • When next step executed i.e. creation of GitRepo, it will show proper bundle count and resource count on the screen.

Feel free to merge and we will see, but I am not sure I undertand what caused it. It is ok if we don't and we try simply not to have other tests affected by encapsulating it somewhere else. But my question remains: is it fixed? and if so, what caused it?

Please let me know if you require any other details. Happy to explain 😄

@mmartin24
Copy link
Collaborator

I get test 51 is being moved, but it is fixed?

Yes, you can see the 2 retries of 2.12-head CI run proves it. Also, I mentioned the whole test execution and where it is failing.

The test passes when tried solely the special_tests sets. We have had in the past this test passing when the p1_2 was run alone. That is not proof itself. Full runs are more accurate as they carry more processes not being settled while executing the full run, but still this is not explanation of what was wrong.

  • Yes, I totally get that full run will ensure more reliability of failed tests are passing. Link of full test run: Updated timeouts and moved test Fleet-51 to special #422 (comment)

  • Regarding why Fleet-51 failure cause:

    • When cluster (imported-0) moved to newly created workspace it take around 30sec of time (before increase in timeout).

    • After 30 seconds timeout over, cluster is still not fully moved to newly created workspace.

    • Next tests step executed i.e. GitRepo creation, which shows resource count (6/6) in the UI but bundle count was not loaded.

    • And the test is failing. (see below screenshot).

    • In the screenshot, we can clearly see bundle count wasn't shown there but we see resource count.
      Screenshot showing resource count but not bundle count

      Test move cluster to newly created workspace and deploy application to it -- Fleet-51 Test move cluster to newly created workspace and deploy application to it  (Qase ID 51) (failed)
  • What is fixed for test Fleet-51:

    • Increasing timeout from 30000 to 40000/60000 to 70000, gives extra time for cluster to be settle properly.
    • When next step executed i.e. creation of GitRepo, it will show proper bundle count and resource count on the screen.

Feel free to merge and we will see, but I am not sure I undertand what caused it. It is ok if we don't and we try simply not to have other tests affected by encapsulating it somewhere else. But my question remains: is it fixed? and if so, what caused it?

Please let me know if you require any other details. Happy to explain 😄

Feel free to merge it.
2.11 lane works well with 30000. We are increasing to 40000. In 2.12 60000 was already high value. Increasing to 70000 it is just keep increasing but not sure I get why this is.

@sbulage
Copy link
Collaborator Author

sbulage commented Nov 25, 2025

I get test 51 is being moved, but it is fixed?

Yes, you can see the 2 retries of 2.12-head CI run proves it. Also, I mentioned the whole test execution and where it is failing.

The test passes when tried solely the special_tests sets. We have had in the past this test passing when the p1_2 was run alone. That is not proof itself. Full runs are more accurate as they carry more processes not being settled while executing the full run, but still this is not explanation of what was wrong.

  • Yes, I totally get that full run will ensure more reliability of failed tests are passing. Link of full test run: Updated timeouts and moved test Fleet-51 to special #422 (comment)

  • Regarding why Fleet-51 failure cause:

    • When cluster (imported-0) moved to newly created workspace it take around 30sec of time (before increase in timeout).
    • After 30 seconds timeout over, cluster is still not fully moved to newly created workspace.
    • Next tests step executed i.e. GitRepo creation, which shows resource count (6/6) in the UI but bundle count was not loaded.
    • And the test is failing. (see below screenshot).
    • In the screenshot, we can clearly see bundle count wasn't shown there but we see resource count.
      Screenshot showing resource count but not bundle count
      Test move cluster to newly created workspace and deploy application to it -- Fleet-51 Test move cluster to newly created workspace and deploy application to it  (Qase ID 51) (failed)
  • What is fixed for test Fleet-51:

    • Increasing timeout from 30000 to 40000/60000 to 70000, gives extra time for cluster to be settle properly.
    • When next step executed i.e. creation of GitRepo, it will show proper bundle count and resource count on the screen.

Feel free to merge and we will see, but I am not sure I undertand what caused it. It is ok if we don't and we try simply not to have other tests affected by encapsulating it somewhere else. But my question remains: is it fixed? and if so, what caused it?

Please let me know if you require any other details. Happy to explain 😄

Feel free to merge it. 2.11 lane works well with 30000. We are increasing to 40000. In 2.12 60000 was already high value. Increasing to 70000 it is just keep increasing but not sure I get why this is.

  • Oops, I will revert the 40000 value to 30000 that was not intentional.
  • Also, here is proof still cluster is not ready within 60000 + 1000 wait timeout.
    • Screenshot taken from CI's video

      image

  • Above screenshot clear states that even cluster is not properly ready and it is executing further step i.e. GitRepo creation.
  • Actual cause of cluster not being fully ready:
    • All default resources in the clusters are not ready in time.
    • fleet-agent might be taking time to load or to sync with fleet-controller

These might be the several causes I can think of, which can be minimized by increasing timeouts. Sometimes cluster may be up and running within timeout as well.

@sbulage sbulage merged commit e85022d into main Nov 25, 2025
11 of 12 checks passed
@sbulage sbulage deleted the fix-test-51 branch November 25, 2025 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automation Add or update automation fleet-e2e-ci Improvements or additions to the CI framework

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Frequent Test Fleet-51 failure on CI

3 participants