-
Notifications
You must be signed in to change notification settings - Fork 25
fix: exit process on any errors during startup with clear info #285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
52883c4 to
e2d9f7a
Compare
WalkthroughAdds structured lifecycle error types and tracking, formats and prints deduplicated exit errors, fails early on dead service processes during registration, moves dataset/tokenizer setup to a PROFILE_CONFIGURE command flow, and expands tests to cover these behaviors and error-display utilities. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User
participant SystemController
participant ServiceManager
participant Services
participant UI
participant Utils as print_exit_errors
rect rgba(230,245,255,0.6)
note over SystemController: Startup wrapped with try_operation_or_stop
User->>SystemController: start()
SystemController->>ServiceManager: initialize()
alt initialize error
ServiceManager-->>SystemController: Exception
SystemController->>SystemController: raise LifecycleOperationError & record ExitErrorInfo
SystemController->>SystemController: trigger stop
end
SystemController->>ServiceManager: start()
ServiceManager->>SystemController: wait_for_all_services_registration()
alt process dead/missing
ServiceManager-->>SystemController: AIPerfError (process died)
SystemController->>SystemController: record ExitErrorInfo("Register Services")
end
end
rect rgba(240,255,240,0.6)
note over SystemController,Services: Profiling configure/start and response parsing
SystemController->>Services: PROFILE_CONFIGURE (all)
Services-->>SystemController: Responses (ok / CommandErrorResponse)
SystemController->>SystemController: _parse_responses_for_errors()
alt errors found
SystemController->>SystemController: raise LifecycleOperationError & record ExitErrorInfo
end
SystemController->>Services: PROFILE_START (all)
Services-->>SystemController: Responses
SystemController->>SystemController: _parse_responses_for_errors()
end
rect rgba(255,240,240,0.6)
note over SystemController,UI: Shutdown and reporting
SystemController->>UI: stop()
SystemController->>SystemController: sleep(0.1)
alt exit_errors present
SystemController->>Utils: print_exit_errors(_exit_errors)
SystemController->>User: os._exit(1)
else
SystemController->>User: os._exit(0)
end
end
sequenceDiagram
autonumber
participant Caller
participant HooksMixin
participant HookA
participant HookB
participant Exceptions as AIPerfMultiError
Caller->>HooksMixin: run_hooks([...])
HooksMixin->>HookA: invoke
alt HookA raises
HookA-->>HooksMixin: Exception e1
HooksMixin->>HooksMixin: create HookError(class, func, e1)
end
HooksMixin->>HookB: invoke
alt HookB raises
HookB-->>HooksMixin: Exception e2
HooksMixin->>HooksMixin: create HookError(class, func, e2)
end
alt any errors
HooksMixin-->>Caller: raise AIPerfMultiError(None, [HookError...])
else
HooksMixin-->>Caller: return success
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (1)
🧰 Additional context used🧬 Code graph analysis (2)tests/controller/test_controller_utils.py (2)
aiperf/common/mixins/aiperf_lifecycle_mixin.py (4)
🔇 Additional comments (6)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
35a367f to
f1c1b8c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
.gitignore (1)
7-7: Fix typo: pycache directory name.The entry is misspelled, so it won’t match Python’s cache directories.
Apply this diff:
-__pychache__/ +__pycache__/aiperf/controller/multiprocess_service_manager.py (1)
207-221: Fix: join() won’t raise asyncio.TimeoutError; may leave processes running.multiprocessing.Process.join(timeout=…) returns after timeout without raising; the except block won’t execute. If the process stays alive, you won’t kill it.
Apply this diff:
- try: - info.process.terminate() - await asyncio.to_thread( - info.process.join, timeout=TASK_CANCEL_TIMEOUT_SHORT - ) - self.debug( - f"Service {info.service_type} process stopped (pid: {info.process.pid})" - ) - except asyncio.TimeoutError: - self.warning( - f"Service {info.service_type} process (pid: {info.process.pid}) did not terminate gracefully, killing" - ) - info.process.kill() + # Ask process to terminate and wait briefly; kill if still alive. + info.process.terminate() + await asyncio.to_thread(info.process.join, timeout=TASK_CANCEL_TIMEOUT_SHORT) + if info.process.is_alive(): + self.warning( + f"Service {info.service_type} process (pid: {info.process.pid}) did not terminate gracefully, killing" + ) + info.process.kill() + # Best-effort final join + await asyncio.to_thread(info.process.join, timeout=TASK_CANCEL_TIMEOUT_SHORT) + else: + self.debug( + f"Service {info.service_type} process stopped (pid: {info.process.pid})" + )aiperf/dataset/dataset_manager.py (1)
93-107: Guard tokenizer fallback: avoid IndexError and provide clear config error.If
user_config.tokenizer.nameis None andendpoint.model_namesis missing/empty, this will raise an unhelpfulIndexError. Raise aConfigurationErrorinstead.+from aiperf.common.exceptions import ConfigurationError @@ - tokenizer_name = self.user_config.tokenizer.name - if tokenizer_name is None: - # TODO: What do we do if there are multiple models? - # How will we know which tokenizer to use? - tokenizer_name = self.user_config.endpoint.model_names[0] + tokenizer_name = self.user_config.tokenizer.name + if tokenizer_name is None: + # TODO: What do we do if there are multiple models? + # How will we know which tokenizer to use? + model_names = getattr(self.user_config.endpoint, "model_names", None) or [] + if not model_names: + raise ConfigurationError( + "Tokenizer name is not set and endpoint.model_names is empty" + ) + tokenizer_name = model_names[0]tests/mixins/test_aiperf_lifecycle_context_manager.py (1)
200-215: Patch async methods as AsyncMock and use LifecycleState instead of strings.Awaiting a non‑awaitable mock will fail. Also assert with typed states.
- with ( - patch.object(lifecycle_component, "stop") as mock_stop, - patch.object(lifecycle_component, "_set_state") as mock_set_state, - ): - # Set a non-stopping state to ensure stop() gets called - await lifecycle_component._set_state("running") + with ( + patch.object(lifecycle_component, "stop", new_callable=AsyncMock) as mock_stop, + patch.object(lifecycle_component, "_set_state", new_callable=AsyncMock) as mock_set_state, + ): + # Set a non-stopping state to ensure stop() gets called + await lifecycle_component._set_state(LifecycleState.RUNNING) @@ - # Verify that the state was set to FAILED - mock_set_state.assert_called_with("failed") + # Verify that the state was set to FAILED + mock_set_state.assert_called_with(LifecycleState.FAILED)
🧹 Nitpick comments (19)
Makefile (1)
115-115: Optional: show missing lines in terminal coverage.term-missing with skip-covered makes terminal output more actionable.
Apply this diff:
- $(activate_venv) && pytest -n auto --cov=aiperf --cov-branch --cov-report=html --cov-report=xml --cov-report=term $(args) + $(activate_venv) && pytest -n auto --cov=aiperf --cov-branch --cov-report=html --cov-report=xml --cov-report=term-missing:skip-covered $(args)aiperf/controller/multiprocess_service_manager.py (1)
174-179: Improve clarity and diagnostics in process liveness check.Works as intended; suggest clearer naming and include exitcode for easier debugging. Also addresses Ruff TRY003 by keeping messages concise/specific.
Apply this diff:
- for process in self.multi_process_info: - if not process.process or not process.process.is_alive(): - raise AIPerfError( - f"Service process {process.service_id} died before registering" - ) + for info in self.multi_process_info: + p = info.process + if not p: + raise AIPerfError( + f"Service {info.service_type} ({info.service_id}) missing process handle before registering" + ) + if not p.is_alive(): + raise AIPerfError( + f"Service {info.service_type} ({info.service_id}) died before registering (exitcode={p.exitcode})" + )aiperf/common/models/error_models.py (1)
54-57: Optionally enforce non-empty operation strings.Add a minimal constraint to avoid empty operation values slipping through.
- operation: str = Field( - ..., - description="The operation that caused the error.", - ) + operation: str = Field( + ..., + min_length=1, + description="The operation that caused the error.", + )tests/controller/test_multiprocess_service_manager.py (3)
53-56: Prefer monkeypatch over reassigning asyncio.sleep.Directly reassigning asyncio.sleep leaks across tests; use pytest’s monkeypatch for isolation.
- async def test_process_dies_before_registration_raises_error( - self, service_manager: MultiProcessServiceManager, mock_dead_process: MagicMock - ): + async def test_process_dies_before_registration_raises_error( + self, + service_manager: MultiProcessServiceManager, + mock_dead_process: MagicMock, + monkeypatch: pytest.MonkeyPatch, + ): @@ - asyncio.sleep = real_sleep + monkeypatch.setattr(asyncio, "sleep", real_sleep)Also applies to: 65-65
138-145: Use monkeypatch here as well to avoid global side effects.Keep the sleep speedup local to this test.
- # Sleep for a fraction of the time for faster test execution - asyncio.sleep = lambda x: real_sleep(0.01 * x) + # Sleep for a fraction of the time for faster test execution + monkeypatch.setattr(asyncio, "sleep", lambda x: real_sleep(0.01 * x))
160-166: Store and await the created task.Avoids dangling tasks and satisfies Ruff RUF006.
- asyncio.create_task(set_stop_event()) + stop_task = asyncio.create_task(set_stop_event()) @@ - await service_manager.wait_for_all_services_registration( + await service_manager.wait_for_all_services_registration( stop_event=stop_event, timeout_seconds=10 ) + await stop_tasktests/test_hooks.py (1)
298-299: Silence TRY003 lints in tests (optional).These raises intentionally use messages for assertions; add noqa to appease Ruff without altering behavior.
- raise ValueError("Something went wrong in the hook") + raise ValueError("Something went wrong in the hook") # noqa: TRY003 @@ - raise ValueError("First error") + raise ValueError("First error") # noqa: TRY003 @@ - raise RuntimeError("Second error") + raise RuntimeError("Second error") # noqa: TRY003 @@ - raise RuntimeError("Hook failed") + raise RuntimeError("Hook failed") # noqa: TRY003Also applies to: 324-329, 417-417
aiperf/dataset/dataset_manager.py (1)
75-79: Silence Ruff ARG002: mark unused command argument.The
messageparameter isn’t used. Rename with a leading underscore to satisfy ARG002 without behavior change.- async def _profile_configure_command( - self, message: ProfileConfigureCommand - ) -> None: + async def _profile_configure_command( + self, _message: ProfileConfigureCommand + ) -> None:aiperf/controller/controller_utils.py (1)
85-93: Optional: guard flush for non-standard streams.Some file-like objects may not expose flush; harmless to skip if absent.
- console.file.flush() + if hasattr(console.file, "flush"): + console.file.flush()aiperf/common/mixins/aiperf_lifecycle_mixin.py (2)
202-215: Optional: use lazy logging for consistency and cheaper formatting.Elsewhere you use lazy lambdas; mirror that for error/debug to avoid formatting when disabled.
- except Exception as e: - self.error(f"Failed to {operation.lower()}: {e}") + except Exception as e: + self.error(lambda: f"Failed to {operation.lower()}: {e}")
271-285: Optional: use lazy logging in _fail for consistency.- self.error(f"Failed for {self}: {e}") + self.error(lambda: f"Failed for {self}: {e}") @@ - if self.state != LifecycleState.STOPPING: - self.debug(f"Stopping {self} due to failure") + if self.state != LifecycleState.STOPPING: + self.debug(lambda: f"Stopping {self} due to failure")aiperf/common/exceptions.py (2)
17-20: Docstring mismatch: update to reflect behavior.The docstring says “with the class name,” but you return the base string only.
- def __str__(self) -> str: - """Return the string representation of the exception with the class name.""" - return super().__str__() + def __str__(self) -> str: + """Return the string representation of the exception (no class name prefix).""" + return super().__str__()
31-35: Improve readability of aggregated error messages.Separate errors with “; ” instead of “,”.
- if message: - super().__init__(f"{message}: {','.join(err_strings)}") - else: - super().__init__(",".join(err_strings)) + if message: + super().__init__(f"{message}: {'; '.join(err_strings)}") + else: + super().__init__('; '.join(err_strings))tests/controller/test_system_controller.py (3)
76-89: Remove unused fixture argumentmock_exception(Ruff ARG002).It’s not used in this test. Drop it from the signature to satisfy linting.
- async def test_system_controller_exits_on_profile_configure_error_response( - self, - system_controller: SystemController, - mock_exception: MockTestException, - error_response: CommandErrorResponse, - ): + async def test_system_controller_exits_on_profile_configure_error_response( + self, + system_controller: SystemController, + error_response: CommandErrorResponse, + ):
109-121: Remove unused fixture argumentmock_exception(Ruff ARG002).Same here; it’s unused.
- async def test_system_controller_exits_on_profile_start_error_response( - self, - system_controller: SystemController, - mock_exception: MockTestException, - error_response: CommandErrorResponse, - ): + async def test_system_controller_exits_on_profile_start_error_response( + self, + system_controller: SystemController, + error_response: CommandErrorResponse, + ):
76-139: Consider parametrizing the two CommandErrorResponse tests.Both configure/start error tests share structure; parametrize command/operation to reduce duplication.
aiperf/controller/system_controller.py (3)
464-469: Reuse a single Console for error panel and log path line.Avoids mixed console instances and ensures consistent formatting; also reduces buffering surprises.
- else: - print_exit_errors(self._exit_errors) - self._print_log_file_info(Console()) + else: + console = Console() + print_exit_errors(self._exit_errors, console=console) + self._print_log_file_info(console)
216-241: Raise only when this call added new errors; avoid false positives from prior state.If
_exit_errorshad entries from earlier operations (shouldn’t happen normally, but defensive code helps), this would raise even when current responses are all successes.def _parse_responses_for_errors( self, responses: list[CommandResponse | ErrorDetails], operation: str ) -> None: """Parse the responses for errors.""" - for response in responses: + errors_before = len(self._exit_errors) + for response in responses: if isinstance(response, ErrorDetails): self._exit_errors.append( ExitErrorInfo( error_details=response, operation=operation, service_id=None ) ) elif isinstance(response, CommandErrorResponse): self._exit_errors.append( ExitErrorInfo( error_details=response.error, operation=operation, service_id=response.service_id, ) ) - if self._exit_errors: + if len(self._exit_errors) > errors_before: raise LifecycleOperationError( operation=operation, original_exception=None, lifecycle_id=self.id, )
206-214: Timeout/attribution gaps: lost service_id on raw ErrorDetails (e.g., timeouts).
send_command_and_wait_for_all_responsesreturns bareErrorDetailson timeout, so_parse_responses_for_errorscan’t attribute failures to specific services (setsservice_id=None). Consider returning a structured result that includes the originating service_id for error cases.Options:
- Change
send_command_and_wait_for_all_responsesto return tuples(service_id, response_or_error)and adjust parsing accordingly.- Or define a
CommandTimeoutErrorResponse(withservice_id) to standardize error attribution likeCommandErrorResponse.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (22)
.gitignore(1 hunks)Makefile(1 hunks)aiperf/common/exceptions.py(2 hunks)aiperf/common/mixins/aiperf_lifecycle_mixin.py(5 hunks)aiperf/common/mixins/hooks_mixin.py(2 hunks)aiperf/common/models/__init__.py(2 hunks)aiperf/common/models/error_models.py(2 hunks)aiperf/controller/__init__.py(2 hunks)aiperf/controller/controller_utils.py(1 hunks)aiperf/controller/multiprocess_service_manager.py(1 hunks)aiperf/controller/system_controller.py(10 hunks)aiperf/dataset/dataset_manager.py(2 hunks)tests/controller/__init__.py(1 hunks)tests/controller/conftest.py(1 hunks)tests/controller/test_controller_utils.py(1 hunks)tests/controller/test_multiprocess_service_manager.py(1 hunks)tests/controller/test_system_controller.py(1 hunks)tests/mixins/__init__.py(1 hunks)tests/mixins/test_aiperf_lifecycle_context_manager.py(1 hunks)tests/mixins/test_aiperf_lifecycle_mixin.py(1 hunks)tests/services/test_dataset_manager.py(4 hunks)tests/test_hooks.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (18)
aiperf/controller/__init__.py (1)
aiperf/controller/controller_utils.py (1)
print_exit_errors(24-93)
aiperf/controller/multiprocess_service_manager.py (1)
aiperf/common/exceptions.py (1)
AIPerfError(10-19)
aiperf/common/mixins/hooks_mixin.py (3)
aiperf/common/exceptions.py (3)
AIPerfMultiError(22-34)HookError(37-44)UnsupportedHookError(154-155)aiperf/common/hooks.py (1)
func_name(89-90)aiperf/common/mixins/aiperf_logger_mixin.py (1)
exception(104-107)
tests/test_hooks.py (3)
aiperf/common/exceptions.py (3)
AIPerfMultiError(22-34)HookError(37-44)UnsupportedHookError(154-155)aiperf/common/hooks.py (4)
provides_hooks(205-227)AIPerfHook(41-56)on_init(230-247)on_start(250-267)aiperf/common/mixins/hooks_mixin.py (2)
HooksMixin(17-192)attach_hook(96-114)
aiperf/common/models/error_models.py (2)
aiperf/common/exceptions.py (1)
LifecycleOperationError(63-79)aiperf/common/models/base_models.py (1)
AIPerfBaseModel(25-53)
aiperf/dataset/dataset_manager.py (6)
aiperf/common/hooks.py (2)
on_command(442-463)on_request(414-439)aiperf/common/enums/command_enums.py (1)
CommandType(7-16)aiperf/timing/timing_manager.py (1)
_profile_configure_command(88-129)aiperf/records/record_processor_service.py (1)
_profile_configure_command(99-103)aiperf/common/messages/command_messages.py (1)
ProfileConfigureCommand(299-305)aiperf/common/protocols.py (1)
info(69-69)
tests/controller/test_controller_utils.py (3)
aiperf/common/models/error_models.py (2)
ErrorDetails(11-47)ExitErrorInfo(50-71)aiperf/controller/controller_utils.py (2)
_group_errors_by_details(14-21)print_exit_errors(24-93)tests/controller/conftest.py (1)
error_details(68-70)
aiperf/common/models/__init__.py (1)
aiperf/common/models/error_models.py (1)
ExitErrorInfo(50-71)
tests/controller/test_multiprocess_service_manager.py (5)
aiperf/common/enums/service_enums.py (1)
ServiceType(32-49)aiperf/common/exceptions.py (1)
AIPerfError(10-19)aiperf/controller/multiprocess_service_manager.py (3)
MultiProcessRunInfo(27-40)MultiProcessServiceManager(45-230)wait_for_all_services_registration(139-200)tests/utils/time_traveler.py (2)
real_sleep(85-86)sleep(37-43)tests/conftest.py (1)
service_config(188-192)
aiperf/common/mixins/aiperf_lifecycle_mixin.py (4)
aiperf/common/enums/service_enums.py (1)
LifecycleState(19-29)aiperf/common/exceptions.py (2)
InvalidStateError(122-123)LifecycleOperationError(63-79)aiperf/common/models/error_models.py (4)
ErrorDetails(11-47)ExitErrorInfo(50-71)from_lifecycle_operation_error(64-71)from_exception(42-47)aiperf/common/mixins/aiperf_logger_mixin.py (2)
error(99-102)debug(74-77)
tests/mixins/test_aiperf_lifecycle_context_manager.py (3)
aiperf/common/exceptions.py (1)
LifecycleOperationError(63-79)aiperf/common/mixins/aiperf_lifecycle_mixin.py (5)
AIPerfLifecycleMixin(37-333)try_operation_or_stop(190-214)initialize_and_start(216-221)_set_state(70-79)_fail(267-285)aiperf/common/models/error_models.py (1)
ExitErrorInfo(50-71)
tests/services/test_dataset_manager.py (4)
aiperf/common/messages/command_messages.py (1)
ProfileConfigureCommand(299-305)aiperf/dataset/dataset_manager.py (1)
_profile_configure_command(76-91)tests/conftest.py (1)
user_config(182-184)tests/composers/conftest.py (1)
custom_config(163-172)
tests/mixins/test_aiperf_lifecycle_mixin.py (4)
aiperf/common/enums/service_enums.py (1)
LifecycleState(19-29)aiperf/common/exceptions.py (1)
InvalidStateError(122-123)aiperf/common/mixins/aiperf_lifecycle_mixin.py (13)
AIPerfLifecycleMixin(37-333)state(64-65)was_initialized(82-83)was_started(86-87)was_stopped(90-91)is_running(94-96)stop_requested(99-101)stop_requested(104-108)initialize(135-162)start(164-187)stop(223-242)_set_state(70-79)initialize_and_start(216-221)tests/mixins/test_aiperf_lifecycle_context_manager.py (1)
lifecycle_component(18-22)
aiperf/common/exceptions.py (1)
aiperf/common/protocols.py (1)
exception(74-74)
aiperf/controller/controller_utils.py (2)
aiperf/common/models/error_models.py (2)
ErrorDetails(11-47)ExitErrorInfo(50-71)tests/controller/conftest.py (1)
error_details(68-70)
tests/controller/conftest.py (5)
aiperf/common/enums/command_enums.py (1)
CommandType(7-16)aiperf/common/messages/command_messages.py (1)
CommandErrorResponse(168-185)aiperf/common/models/error_models.py (2)
ErrorDetails(11-47)from_exception(42-47)tests/conftest.py (1)
service_config(188-192)aiperf/common/mixins/aiperf_lifecycle_mixin.py (1)
stop(223-242)
aiperf/controller/system_controller.py (7)
aiperf/common/exceptions.py (1)
LifecycleOperationError(63-79)aiperf/common/models/error_models.py (2)
ErrorDetails(11-47)ExitErrorInfo(50-71)aiperf/common/models/record_models.py (1)
ProcessRecordsResult(92-103)aiperf/controller/controller_utils.py (1)
print_exit_errors(24-93)aiperf/common/mixins/aiperf_lifecycle_mixin.py (1)
try_operation_or_stop(190-214)aiperf/common/mixins/command_handler_mixin.py (1)
send_command_and_wait_for_all_responses(188-222)aiperf/common/messages/command_messages.py (2)
CommandResponse(100-165)CommandErrorResponse(168-185)
tests/controller/test_system_controller.py (7)
aiperf/common/enums/command_enums.py (1)
CommandType(7-16)aiperf/common/exceptions.py (1)
LifecycleOperationError(63-79)aiperf/common/messages/command_messages.py (1)
CommandErrorResponse(168-185)aiperf/common/models/error_models.py (3)
ErrorDetails(11-47)ExitErrorInfo(50-71)from_exception(42-47)tests/controller/conftest.py (6)
system_controller(31-58)MockTestException(18-19)error_details(68-70)mock_service_manager(23-27)mock_exception(62-64)error_response(74-81)aiperf/controller/system_controller.py (6)
SystemController(66-551)initialize(136-143)_initialize_system_controller(146-155)_start_profiling_all_services(203-214)_profile_configure_all_services(183-201)_start_services(158-181)aiperf/common/mixins/command_handler_mixin.py (1)
send_command_and_wait_for_all_responses(188-222)
🪛 Ruff (0.13.1)
aiperf/controller/multiprocess_service_manager.py
176-178: Avoid specifying long messages outside the exception class
(TRY003)
tests/test_hooks.py
298-298: Avoid specifying long messages outside the exception class
(TRY003)
324-324: Avoid specifying long messages outside the exception class
(TRY003)
328-328: Avoid specifying long messages outside the exception class
(TRY003)
417-417: Avoid specifying long messages outside the exception class
(TRY003)
aiperf/dataset/dataset_manager.py
77-77: Unused method argument: message
(ARG002)
tests/controller/test_multiprocess_service_manager.py
160-160: Store a reference to the return value of asyncio.create_task
(RUF006)
tests/mixins/test_aiperf_lifecycle_context_manager.py
86-86: Create your own exception
(TRY002)
86-86: Avoid specifying long messages outside the exception class
(TRY003)
103-103: Create your own exception
(TRY002)
103-103: Avoid specifying long messages outside the exception class
(TRY003)
169-169: Avoid specifying long messages outside the exception class
(TRY003)
tests/controller/test_system_controller.py
80-80: Unused method argument: mock_exception
(ARG002)
112-112: Unused method argument: mock_exception
(ARG002)
🔇 Additional comments (20)
tests/controller/__init__.py (1)
1-2: LGTM: SPDX headers look correct and consistent.No functional impact; compliant usage of SPDX tags.
Adding init.py makes tests/controller a package. Confirm this is intentional for pytest discovery/imports and doesn’t alter test collection unexpectedly.
tests/mixins/__init__.py (1)
1-2: LGTM: license header only.No functional changes; looks good.
.gitignore (1)
24-24: LGTM: ignore coverage.xml.Keeping CI artifacts out of VCS is good practice.
aiperf/controller/__init__.py (1)
6-8: LGTM: re-exporting print_exit_errors.Public API addition is consistent and non-breaking.
Also applies to: 38-39
aiperf/common/models/__init__.py (1)
31-32: LGTM: ExitErrorInfo re-export.Public surface updated consistently with error reporting utilities.
Also applies to: 84-85
aiperf/common/mixins/hooks_mixin.py (2)
9-9: LGTM: import HookError.Consistent with new exception model.
187-193: LGTM: wrap hook exceptions and aggregate.HookError improves diagnostics; AIPerfMultiError aggregation is appropriate.
tests/services/test_dataset_manager.py (1)
11-11: LGTM: switch to ProfileConfigureCommand for dataset setup.Aligns tests with the command-driven configuration flow; tokenizer load is exercised via the same path.
Also applies to: 71-73, 137-139, 197-199
aiperf/common/models/error_models.py (2)
7-7: Import looks correct.Brings in LifecycleOperationError for the mapper; no issues.
50-72: ExitErrorInfo and mapper are well-structured.Fields and the from_lifecycle_operation_error mapping read cleanly and align with ErrorDetails hashing/dedup use.
tests/mixins/test_aiperf_lifecycle_mixin.py (1)
16-181: LGTM on lifecycle basics coverage.Covers state flags, transitions, idempotency, and invalid-state errors comprehensively.
tests/controller/test_controller_utils.py (1)
17-341: Solid, focused tests for grouping and rendering.Good coverage: empty input, dedup, service/operation formatting, wrapping, and default Console path.
tests/test_hooks.py (1)
7-7: Import updates are appropriate.Brings in new exception types used by hooks and aggregation.
tests/controller/test_multiprocess_service_manager.py (1)
16-16: No changes needed—real_sleep import is valid
tests/conftest.py assignsreal_sleep = asyncio.sleep, so importing it in the test is correct.tests/controller/conftest.py (1)
38-42: Remove unnecessary CommunicationFactory patch Conftest patchesaiperf.common.factories.CommunicationFactory, but SystemController doesn’t import or use it—remove thatpatch("aiperf.common.factories.CommunicationFactory")line.Likely an incorrect or invalid review comment.
aiperf/common/exceptions.py (1)
63-80: LifecycleOperationError addition looks good.API, captured fields, and message behavior align with the new lifecycle flow.
tests/controller/test_system_controller.py (2)
40-71: Good coverage of the happy path.Verifies initialize/start flows and asserts no exit errors while checking core calls. Solid.
140-187: Lifecycle error wrapping behavior validated well.The tests ensure inner exceptions are wrapped with the higher-level operation and that exit errors record the outer operation while preserving the inner message. Nice attention to detail.
aiperf/controller/system_controller.py (2)
152-175: Strong: guarded lifecycle steps with try_operation_or_stop.Wrapping initialize/start/register with unified error capture and propagation aligns with the PR goal to exit on startup errors.
474-476: Hard exit via os._exit: confirm intent and logging guarantees.
os._exitbypasses atexit handlers and may skip remaining flushes in third‑party libs. Given you explicitly flush console and stop services/UI, this is likely intentional. Verify logging handlers/queues are flushed before this point.Suggested follow-up:
- Ensure global logging queue/handlers flush before exit (e.g., call logging.shutdown() or equivalent in your logging setup).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clean work! I appreciate the defensive programming and the comprehensive unit testing. 🚀
CodeRabbit caught some good edge cases. Once that feedback is addressed, I think this is good to merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (6)
tests/mixins/test_aiperf_lifecycle_context_manager.py (3)
23-23: Initialize fixture state explicitly.Setting
component._exit_errors = []is fine, but consider constructing via a helper/factory to avoid reaching into internals in tests. Not critical.
170-185: Patch at the usage site to avoid aliasing pitfalls.Patching
ExitErrorInfo.from_lifecycle_operation_erroronaiperf.common.modelsworks, but patching where it’s used is more robust:
- Target:
aiperf.common.mixins.aiperf_lifecycle_mixin.ExitErrorInfo.from_lifecycle_operation_errorPrevents issues if the class is re-exported or imported under a different alias.
187-228: Minor test intent nit: state won’t change due to patched _set_state.You patch
_set_statewithAsyncMockthenawait lifecycle_component._set_state(LifecycleState.RUNNING). That call won’t update internal state. The test only needs a non‑STOPPING state, so it still passes, but if you want the state to actually change, don’t patch_set_stateor setlifecycle_component._state = LifecycleState.RUNNINGdirectly.aiperf/controller/system_controller.py (3)
191-201: Wrap configure in try_operation_or_stop to improve error attribution.If
_parse_responses_for_errorsraises, the failure is logged under “Start” by the outer context. Wrap this call so errors attribute to “Configure Profiling”.Suggested change:
- responses = await self.send_command_and_wait_for_all_responses( + responses = await self.send_command_and_wait_for_all_responses( ProfileConfigureCommand( service_id=self.service_id, config=self.user_config, ), list(self.service_manager.service_id_map.keys()), timeout=DEFAULT_PROFILE_CONFIGURE_TIMEOUT, ) - duration = time.perf_counter() - begin - self._parse_responses_for_errors(responses, "Configure Profiling") + duration = time.perf_counter() - begin + # Attribute any errors to Configure Profiling (not Start) + try: + self._parse_responses_for_errors(responses, "Configure Profiling") + except LifecycleOperationError: + # Re-raise so initialize_and_start() can handle graceful shutdown + raise self.info(f"All services configured in {duration:.2f} seconds")
206-214: Also wrap start profiling in try_operation_or_stop for consistent attribution.Same rationale as configure.
- responses = await self.send_command_and_wait_for_all_responses( + responses = await self.send_command_and_wait_for_all_responses( ProfileStartCommand( service_id=self.service_id, ), list(self.service_manager.service_id_map.keys()), timeout=DEFAULT_PROFILE_START_TIMEOUT, ) - self._parse_responses_for_errors(responses, "Start Profiling") + try: + self._parse_responses_for_errors(responses, "Start Profiling") + except LifecycleOperationError: + raise self.info("All services started profiling successfully")
462-475: Flushes are correctly placed before hard-exit; consider centralizing.You flush after the major print paths. Optionally add a flush inside
_print_log_file_infoso any future callers are covered without needing to remember to flush.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
aiperf/common/config/dev_config.py(1 hunks)aiperf/controller/controller_utils.py(1 hunks)aiperf/controller/system_controller.py(10 hunks)tests/mixins/test_aiperf_lifecycle_context_manager.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
aiperf/controller/system_controller.py (6)
aiperf/common/exceptions.py (1)
LifecycleOperationError(63-79)aiperf/common/models/error_models.py (2)
ErrorDetails(11-47)ExitErrorInfo(50-71)aiperf/controller/controller_utils.py (1)
print_exit_errors(24-95)aiperf/common/mixins/aiperf_lifecycle_mixin.py (1)
try_operation_or_stop(190-214)aiperf/common/mixins/command_handler_mixin.py (1)
send_command_and_wait_for_all_responses(188-222)aiperf/common/messages/command_messages.py (2)
CommandResponse(100-165)CommandErrorResponse(168-185)
aiperf/controller/controller_utils.py (1)
aiperf/common/models/error_models.py (2)
ErrorDetails(11-47)ExitErrorInfo(50-71)
tests/mixins/test_aiperf_lifecycle_context_manager.py (4)
aiperf/common/enums/service_enums.py (1)
LifecycleState(19-29)aiperf/common/exceptions.py (1)
LifecycleOperationError(63-79)aiperf/common/mixins/aiperf_lifecycle_mixin.py (4)
try_operation_or_stop(190-214)initialize_and_start(216-221)_set_state(70-79)_fail(267-285)aiperf/common/models/error_models.py (1)
ExitErrorInfo(50-71)
🪛 Ruff (0.13.1)
tests/mixins/test_aiperf_lifecycle_context_manager.py
87-87: Create your own exception
(TRY002)
87-87: Avoid specifying long messages outside the exception class
(TRY003)
104-104: Create your own exception
(TRY002)
104-104: Avoid specifying long messages outside the exception class
(TRY003)
178-178: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build (ubuntu-latest, 3.10)
🔇 Additional comments (5)
aiperf/common/config/dev_config.py (1)
34-34: Good addition: explicit flush ensures the warning survives hard-exit.This prevents the panel output from being lost before os._exit. No further changes needed.
tests/mixins/test_aiperf_lifecycle_context_manager.py (1)
123-149: Async patches are correct; success/failure paths are awaitable.Switching to
AsyncMockforinitializeandstartavoids non‑awaitable mocks. Looks good.aiperf/controller/system_controller.py (1)
514-520: Optional: flush after printing the log file path.Centralize flush to avoid relying on callers.
def _print_log_file_info(self, console: Console) -> None: """Print the log file info.""" log_file = self.user_config.output.artifact_directory / "logs" / "aiperf.log" console.print( f"[bold green]Log File:[/bold green] [cyan]{log_file.resolve()}[/cyan]" ) + console.file.flush()aiperf/controller/controller_utils.py (2)
75-82: Width clamp and wrapping look good.Using
max(console.size.width - 15, 20)avoids textwrap errors on narrow terminals. Nice.
87-95: Good: flush after printing the error panel.Ensures errors remain visible when followed by
os._exit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
tests/controller/test_controller_utils.py (3)
41-46: Also assert no Console is created for empty inputs.Prevents accidental side-effects and unnecessary work when there’s nothing to print.
Apply this diff to add a test:
@pytest.mark.parametrize("errors", [None, []]) def test_empty_input_handling(self, errors): """Test that print_exit_errors handles None and empty list gracefully.""" # Should not raise any exception print_exit_errors(errors) + @pytest.mark.parametrize("errors", [None, []]) + def test_empty_input_skips_console_creation(self, errors): + """Ensure we don't instantiate a Console when there's nothing to print.""" + with patch("aiperf.controller.controller_utils.Console") as mock_console_class: + print_exit_errors(errors) + mock_console_class.assert_not_called()
83-99: Make wrapping assertion less brittle.Counting total lines can be flaky across terminals. Assert presence of a line break after “Reason:” instead.
Apply these diffs:
@@ -import pytest +import pytest +import re @@ result = output.getvalue() - assert "Reason:" in result - assert "This is a very long error message" in result - assert len(result.split("\n")) > 8 + assert "Reason:" in result + assert "This is a very long error message" in result + # Ensure wrapping occurred: at least one newline after "Reason:" within the message block + assert re.search(r"Reason:.*\n", result), "Expected wrapped message on multiple lines"
38-67: Add a test for Unknown error type fallback.Covers the
type=Nonecase where “Unknown” should be displayed.Apply this diff to add the test within TestPrintExitErrors:
class TestPrintExitErrors: """Test the print_exit_errors function.""" @@ def test_service_id_handling(self, service_id, expected_display): """Test service_id display for various values.""" error = _create_basic_error(service_id=service_id) console, output = _create_test_console_output(80) print_exit_errors([error], console) result = output.getvalue() assert f"• Service: {expected_display}" in result assert "Operation: Test Operation" in result assert "Error: TestError" in result assert "Reason: Test message" in result + + def test_unknown_error_type_fallback(self): + """Error type None should render as 'Unknown'.""" + error = ExitErrorInfo( + error_details=ErrorDetails(type=None, message="Some message"), + operation="Op", + service_id="svc", + ) + console, output = _create_test_console_output(80) + print_exit_errors([error], console) + assert "Error: Unknown" in output.getvalue()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/controller/test_controller_utils.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/controller/test_controller_utils.py (2)
aiperf/common/models/error_models.py (2)
ErrorDetails(11-47)ExitErrorInfo(50-71)aiperf/controller/controller_utils.py (2)
_group_errors_by_details(14-21)print_exit_errors(24-95)
🔇 Additional comments (6)
tests/controller/test_controller_utils.py (6)
1-2: License headers: good to go.SPDX headers present and correct.
47-66: Service ID display coverage is solid.Covers None/empty and concrete IDs; aligns with "N/A" fallback.
100-115: Default console creation behavior verified.Good verification of instantiation, print call, and flush.
143-160: Grouping duplicates by ErrorDetails: tests look correct.Covers equality/hash semantics and multi-entry aggregation.
217-247: Dedup and grouping expectations are precise.Single block for identical errors and aggregated service listing are validated well.
316-343: Service display formatting matrix is comprehensive.Covers 1, 2, 3, and >3 with the “etc.” truncation.
the-david-oy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent!
The benchmark system will be shutdown and an error will be printed to the console along with the path to the log files in any of the following situations:
initializeorstartof the system controllerPROFILE_CONFIGUREcommandPROFILE_STARTcommandThe code will wait until the UI dashboard has exiting to print the errors in order to ensure they are not cleared.
I have moved the loading of the tokenizer for the dataset manager into the profile_configure where is should belong going forward. before, if the tokenizer failed to load, the process would die before registering
Added lots of new unit tests
Pics or it didn't happen:
Logs:
Summary by CodeRabbit
New Features
Bug Fixes
Chores
Tests