-
Notifications
You must be signed in to change notification settings - Fork 25
fix: handle startup errors more gracefully #207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements improved error handling for system startup by introducing graceful error management and automatic shutdown on initialization failures. The system now properly tracks its state and handles timeout or error responses during critical phases instead of hanging indefinitely.
Key changes:
- Added
SystemStatetracking to monitor system progression through startup phases - Implemented automatic shutdown when errors occur during initialization, registration, configuration, or profiling start
- Added utility functions to extract and display command errors with user-friendly formatting
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| aiperf/controller/system_utils.py | New utility module for error extraction and display formatting |
| aiperf/controller/system_controller.py | Updated to track system state and handle startup errors gracefully with automatic shutdown |
| aiperf/controller/init.py | Exports new utility functions for error handling |
| aiperf/common/enums/system_enums.py | Simplified SystemState enum with updated state names and removed verbose comments |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
| ) | ||
| ) | ||
| # Remove the trailing newline from the last summary item | ||
| summary[-1]._text[-1] = summary[-1]._text[-1].rstrip("\n") |
Copilot
AI
Aug 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Direct access to private attributes (_text) breaks encapsulation and makes the code fragile to changes in the Rich library. Consider using the public API or building the text without trailing newlines instead of modifying internal state.
| summary[-1]._text[-1] = summary[-1]._text[-1].rstrip("\n") | |
| summary[-1] = Text(summary[-1].plain.rstrip("\n"), style=summary[-1].style) |
| async def _start_profiling_all_services(self) -> None: | ||
| errors = extract_errors(responses) | ||
| if errors: | ||
| display_command_errors("Failed to configure all services", errors) |
Copilot
AI
Aug 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Using asyncio.shield() prevents cancellation of the stop operation, but since this is in an error handling path, consider whether shielding is necessary. If the goal is to ensure cleanup always runs, this pattern is correct, but it should be documented why shielding is needed here.
| display_command_errors("Failed to configure all services", errors) | |
| display_command_errors("Failed to configure all services", errors) | |
| # Shielding stop() to ensure cleanup always runs even if cancelled. |
| errors = extract_errors(responses) | ||
| if errors: | ||
| display_command_errors("Failed to start profiling all services", errors) | ||
| await asyncio.shield(self.stop()) |
Copilot
AI
Aug 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Duplicate use of asyncio.shield() for stop operations. Consider extracting this pattern into a helper method to reduce code duplication and ensure consistent error handling behavior across the codebase.
the-david-oy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonderful improvement! Left a couple of quick comments.
| """Configure all services to start profiling. | ||
| This is a blocking call that will wait for all services to be configured before returning. This way | ||
| we can ensure that all services are configured before we start profiling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the comments to make it clear what the return values signify?
|
Unfortunately, this code does not catch errors before services are registered, meaning anything in the init functions of services are not caught and handled, instead you have to wait for the system registration to fail. Also, due to the screen clearing when using the dashboard ui, the errors are not visible, so it needs to be re-implemented in a way that will print it out after the UI is stopped (similar to the console exporters) More work needs to go into this PR before it can be ready to merge. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
Closing in favor of #285 when it is ready |
Brought back
SystemStateto track overall state of the systemIf an error occurs during initialization, registration, profile_configure, or profile_start, the system will shutdown automatically instead of sort-of hanging indefinitely. or trying to run in a partial state.
Errors are displayed at the end for visibility