-
Notifications
You must be signed in to change notification settings - Fork 25
Description
with 0.4.0:
it starts diplaing the UI metrics and then in 30 sec to a min it fails
Full log(from another run): i had to kill the run from above snap and hence complete log is not avaialble but issue seems consistent
2026-01-16 10:49:41 - SyntheticDatasetComposer - INFO - Using default sampling strategy for synthetic dataset: shuffle
2026-01-16 10:49:41 - dataset_manager_ce93faa4 - INFO - Generating inputs.json file at /workdir/AIE_1.10.0/llama-3.1-70b-instruct/nim_1.10/new-comp2/aiperf/aiperf-artifact/ISL200_OSL200/CON512/inputs.json
2026-01-16 10:49:41 - dataset_manager_ce93faa4 - INFO - inputs.json file generated in 0.01 seconds
2026-01-16 10:49:41 - dataset_manager_ce93faa4 - INFO - Dataset configured in 0.33 seconds
2026-01-16 10:49:48 - InferenceResultParser_c0198a97 - INFO - Initialized tokenizers: {'meta/llama-3.1-70b-instruct': {'class': 'PreTrainedTokenizerFast', 'name_or_path': 'meta-llama/Llama-3.1-70B-Instruct'}} in 9.62 seconds
2026-01-16 10:49:48 - system_controller - INFO - All services configured in 9.62 seconds
2026-01-16 10:49:48 - system_controller - INFO - AIPerf System is CONFIGURED
2026-01-16 10:49:48 - timing_manager_7b415879 - INFO - Credit issuing strategy for Request_Rate started
2026-01-16 10:50:48 - system_controller - ERROR - Error running Hook(func=<bound method SystemController._start_services of <SystemController system_controller (state=starting)>>, params=None) hook for SystemController: Failed to perform operation 'Start Profiling'
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/aiperf/common/mixins/hooks_mixin.py", line 186, in run_hooks
await hook(**kwargs)
File "/usr/local/lib/python3.12/dist-packages/aiperf/common/hooks.py", line 115, in call
await self.func(**kwargs)
File "/usr/local/lib/python3.12/dist-packages/aiperf/controller/system_controller.py", line 212, in _start_services
await self._start_profiling_all_services()
File "/usr/local/lib/python3.12/dist-packages/aiperf/controller/system_controller.py", line 244, in _start_profiling_all_services
self._parse_responses_for_errors(responses, "Start Profiling")
File "/usr/local/lib/python3.12/dist-packages/aiperf/controller/system_controller.py", line 267, in _parse_responses_for_errors
raise LifecycleOperationError(
aiperf.common.exceptions.LifecycleOperationError: Failed to perform operation 'Start Profiling'
2026-01-16 10:50:48 - system_controller - ERROR - Failed for SystemController (id=system_controller): SystemController._start_services: Failed to perform operation 'Start Profiling'
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
Execution paramters:
NIM: nvcr.io/nim/meta/llama-3.1-70b-instruct-pb25h1:1.10(2 replicas)
Execution paramters:
./aiperf-bench.sh --url http://x.x.x.x --model meta/llama-3.1-70b-instruct --tokenizer meta-llama/Llama-3.1-70B-Instruct --concurrency-values 1,2,4,8,16,32,64,128,256,512 --use-cases Search,Summarization,Translation --benchmark-duration 900 --benchmark-grace-period 0 --profile-export-file llama-31-70b-fp8-tp2-pp1-latency-2nim-aiperfv04-run4
Everything being as is it succeeds after i remove 0.4 and install 0.3 version :
