Skip to content

fix!: improved exception handling when instance startup fails#189

Merged
moorec-aws merged 1 commit intoaws-deadline:mainlinefrom
moorec-aws:moorec/instance-failure
May 9, 2025
Merged

fix!: improved exception handling when instance startup fails#189
moorec-aws merged 1 commit intoaws-deadline:mainlinefrom
moorec-aws:moorec/instance-failure

Conversation

@moorec-aws
Copy link
Contributor

What was the problem/requirement? (What/Why)

When an instance startup fails, and cleanup is run, we attempt to cleanup the worker. The worker does not exist if instance startup failed which results in a lot of worker cleanup errors making the test logs difficult to parse for the root cause.

What was the solution? (How)

We no longer attempt stop the worker agent if a worker id does not exist. Added a InstanceStartupError exception class that will print diagnostics for the instance failure to help determine the cause. Added a method for collecting information about the instance.

What is the impact of this change?

Removes cleanup errors and provides better diagnostics for instance startup failures.

How was this change tested?

hatch run test

I ran a developer version of the test fixtures package against the deadline-cloud-worker-agent e2e tests and confirmed tests passed.

To replicate a failure case I reduced the the instance running waiting to WaiterConfig={"Delay": 2, "MaxAttempts": 5} to reproduce the exception and confirmed that an error was produced, diagnostics were recorded and that the instance was terminated without attempting to stop the worker:

[2025-05-08 12:16:59,463] [test/e2e/test_job_submissions.py::TestJobSubmission::test_success] Launched EC2 instance i-0ea9cb5cc794090d3
[2025-05-08 12:16:59,463] [test/e2e/test_job_submissions.py::TestJobSubmission::test_success] Waiting for EC2 instance i-0ea9cb5cc794090d3 status to be OK
[2025-05-08 12:17:08,921] [test/e2e/test_job_submissions.py::TestJobSubmission::test_success] Failed to start worker: Failed to wait for instance status: Waiter InstanceStatusOk failed: Max attempts exceeded
================================================================================
DIAGNOSTICS
================================================================================
Collecting diagnostics for instance i-0ea9cb5cc794090d3
Instance state: running
Instance type: t3.large
Launch time: 2025-05-08 17:16:59+00:00
Availability zone: us-west-2a
System status: initializing
Instance status: initializing
System check reachability: initializing
Instance check reachability: initializing
================================================================================
[2025-05-08 12:17:08,923] [test/e2e/test_job_submissions.py::TestJobSubmission::test_success] Stopping worker because it failed to start
[2025-05-08 12:17:08,923] [test/e2e/test_job_submissions.py::TestJobSubmission::test_success] Terminating EC2 instance i-0ea9cb5cc794090d3
[2025-05-08 12:17:09,352] [test/e2e/test_job_submissions.py::TestJobSubmission::test_success] No worker_id available, skipping worker cleanup
ERROR                                                                                                                                                  

Was this change documented?

No

Is this a breaking change?

Yes, we are making a new api call to collect instance information.


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@moorec-aws moorec-aws requested a review from a team as a code owner May 8, 2025 18:33
@moorec-aws moorec-aws force-pushed the moorec/instance-failure branch from 368851c to 465591e Compare May 8, 2025 18:52
jericht
jericht previously approved these changes May 8, 2025
Signed-off-by: Charles Moore <122481442+moorec-aws@users.noreply.github.com>
@moorec-aws moorec-aws force-pushed the moorec/instance-failure branch from cf95082 to 639e2d9 Compare May 9, 2025 20:34
@sonarqubecloud
Copy link

sonarqubecloud bot commented May 9, 2025

@moorec-aws moorec-aws merged commit d2b9d4c into aws-deadline:mainline May 9, 2025
16 checks passed
@moorec-aws moorec-aws deleted the moorec/instance-failure branch May 9, 2025 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants