Skip to content

Conversation

@ajcasagrande
Copy link
Contributor

  • Brought back SystemState to track overall state of the system

  • If an error occurs during initialization, registration, profile_configure, or profile_start, the system will shutdown automatically instead of sort-of hanging indefinitely. or trying to run in a partial state.

    • Errors are determined based on timeout of command, or a command error response from a service.
  • Errors are displayed at the end for visibility

image

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements improved error handling for system startup by introducing graceful error management and automatic shutdown on initialization failures. The system now properly tracks its state and handles timeout or error responses during critical phases instead of hanging indefinitely.

Key changes:

  • Added SystemState tracking to monitor system progression through startup phases
  • Implemented automatic shutdown when errors occur during initialization, registration, configuration, or profiling start
  • Added utility functions to extract and display command errors with user-friendly formatting

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
aiperf/controller/system_utils.py New utility module for error extraction and display formatting
aiperf/controller/system_controller.py Updated to track system state and handle startup errors gracefully with automatic shutdown
aiperf/controller/init.py Exports new utility functions for error handling
aiperf/common/enums/system_enums.py Simplified SystemState enum with updated state names and removed verbose comments

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

)
)
# Remove the trailing newline from the last summary item
summary[-1]._text[-1] = summary[-1]._text[-1].rstrip("\n")
Copy link

Copilot AI Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Direct access to private attributes (_text) breaks encapsulation and makes the code fragile to changes in the Rich library. Consider using the public API or building the text without trailing newlines instead of modifying internal state.

Suggested change
summary[-1]._text[-1] = summary[-1]._text[-1].rstrip("\n")
summary[-1] = Text(summary[-1].plain.rstrip("\n"), style=summary[-1].style)

Copilot uses AI. Check for mistakes.
async def _start_profiling_all_services(self) -> None:
errors = extract_errors(responses)
if errors:
display_command_errors("Failed to configure all services", errors)
Copy link

Copilot AI Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Using asyncio.shield() prevents cancellation of the stop operation, but since this is in an error handling path, consider whether shielding is necessary. If the goal is to ensure cleanup always runs, this pattern is correct, but it should be documented why shielding is needed here.

Suggested change
display_command_errors("Failed to configure all services", errors)
display_command_errors("Failed to configure all services", errors)
# Shielding stop() to ensure cleanup always runs even if cancelled.

Copilot uses AI. Check for mistakes.
errors = extract_errors(responses)
if errors:
display_command_errors("Failed to start profiling all services", errors)
await asyncio.shield(self.stop())
Copy link

Copilot AI Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Duplicate use of asyncio.shield() for stop operations. Consider extracting this pattern into a helper method to reduce code duplication and ensure consistent error handling behavior across the codebase.

Copilot uses AI. Check for mistakes.
@ajcasagrande ajcasagrande self-assigned this Aug 13, 2025
Copy link
Contributor

@the-david-oy the-david-oy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful improvement! Left a couple of quick comments.

"""Configure all services to start profiling.
This is a blocking call that will wait for all services to be configured before returning. This way
we can ensure that all services are configured before we start profiling.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the comments to make it clear what the return values signify?

@ajcasagrande ajcasagrande marked this pull request as draft September 3, 2025 22:17
@ajcasagrande
Copy link
Contributor Author

Unfortunately, this code does not catch errors before services are registered, meaning anything in the init functions of services are not caught and handled, instead you have to wait for the system registration to fail.

Also, due to the screen clearing when using the dashboard ui, the errors are not visible, so it needs to be re-implemented in a way that will print it out after the UI is stopped (similar to the console exporters)

More work needs to go into this PR before it can be ready to merge.

@codecov
Copy link

codecov bot commented Sep 20, 2025

Codecov Report

❌ Patch coverage is 33.33333% with 44 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
aiperf/controller/system_controller.py 20.51% 31 Missing ⚠️
aiperf/controller/system_utils.py 38.09% 13 Missing ⚠️

📢 Thoughts on this report? Let us know!

@ajcasagrande
Copy link
Contributor Author

Closing in favor of #285 when it is ready

@ajcasagrande ajcasagrande deleted the ajc/startup-errors2 branch October 18, 2025 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants