Skip to content

Critical(need help): benchmark fails with aiperf0.4 version but aiperf0.3 succeeds #578

@Maze999

Description

@Maze999

with 0.4.0:

it starts diplaing the UI metrics and then in 30 sec to a min it fails

Image

Full log(from another run): i had to kill the run from above snap and hence complete log is not avaialble but issue seems consistent

2026-01-16 10:49:41 - SyntheticDatasetComposer - INFO - Using default sampling strategy for synthetic dataset: shuffle
2026-01-16 10:49:41 - dataset_manager_ce93faa4 - INFO - Generating inputs.json file at /workdir/AIE_1.10.0/llama-3.1-70b-instruct/nim_1.10/new-comp2/aiperf/aiperf-artifact/ISL200_OSL200/CON512/inputs.json
2026-01-16 10:49:41 - dataset_manager_ce93faa4 - INFO - inputs.json file generated in 0.01 seconds
2026-01-16 10:49:41 - dataset_manager_ce93faa4 - INFO - Dataset configured in 0.33 seconds
2026-01-16 10:49:48 - InferenceResultParser_c0198a97 - INFO - Initialized tokenizers: {'meta/llama-3.1-70b-instruct': {'class': 'PreTrainedTokenizerFast', 'name_or_path': 'meta-llama/Llama-3.1-70B-Instruct'}} in 9.62 seconds
2026-01-16 10:49:48 - system_controller - INFO - All services configured in 9.62 seconds
2026-01-16 10:49:48 - system_controller - INFO - AIPerf System is CONFIGURED
2026-01-16 10:49:48 - timing_manager_7b415879 - INFO - Credit issuing strategy for Request_Rate started
2026-01-16 10:50:48 - system_controller - ERROR - Error running Hook(func=<bound method SystemController._start_services of <SystemController system_controller (state=starting)>>, params=None) hook for SystemController: Failed to perform operation 'Start Profiling'
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/aiperf/common/mixins/hooks_mixin.py", line 186, in run_hooks
await hook(**kwargs)
File "/usr/local/lib/python3.12/dist-packages/aiperf/common/hooks.py", line 115, in call
await self.func(**kwargs)
File "/usr/local/lib/python3.12/dist-packages/aiperf/controller/system_controller.py", line 212, in _start_services
await self._start_profiling_all_services()
File "/usr/local/lib/python3.12/dist-packages/aiperf/controller/system_controller.py", line 244, in _start_profiling_all_services
self._parse_responses_for_errors(responses, "Start Profiling")
File "/usr/local/lib/python3.12/dist-packages/aiperf/controller/system_controller.py", line 267, in _parse_responses_for_errors
raise LifecycleOperationError(
aiperf.common.exceptions.LifecycleOperationError: Failed to perform operation 'Start Profiling'
2026-01-16 10:50:48 - system_controller - ERROR - Failed for SystemController (id=system_controller): SystemController._start_services: Failed to perform operation 'Start Profiling'
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')
2026-01-16 10:50:48 - AioHttpClient - ERROR - Error in aiohttp request: RuntimeError('Connection closed.')

Execution paramters:

NIM: nvcr.io/nim/meta/llama-3.1-70b-instruct-pb25h1:1.10(2 replicas)

Execution paramters:

./aiperf-bench.sh --url http://x.x.x.x --model meta/llama-3.1-70b-instruct --tokenizer meta-llama/Llama-3.1-70B-Instruct --concurrency-values 1,2,4,8,16,32,64,128,256,512 --use-cases Search,Summarization,Translation --benchmark-duration 900 --benchmark-grace-period 0 --profile-export-file llama-31-70b-fp8-tp2-pp1-latency-2nim-aiperfv04-run4

Everything being as is it succeeds after i remove 0.4 and install 0.3 version :

Image

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions