-
Notifications
You must be signed in to change notification settings - Fork 21
Description
I am running the MLPerf inference benchmark for the Llama2-70b-99 model on a cluster with 6 MI210 GPUs. Below is the command I am using with CM:
cm run script --tags=run-mlperf,inference,_find-performance,_full,_r5.0-dev --model=llama2-70b-99 --implementation=reference --framework=pytorch --category=datacenter --scenario=Offline --execution_mode=test --device=rocm --quiet --test_query_count=10 --env.LLAMA2_CHECKPOINT_PATH=/home/intern01/Llama-2-70b-chat-hf
When I try to run the script with the --device rocm option, I get the error message above. It seems that rocm is not recognized as a valid device option, as the script only accepts cpu or cuda:0. This is the full message `CM script::benchmark-program/run.sh
Run Directory: /home/intern01/CM/repos/local/cache/12bee67ce1d840d4/inference/language/llama2-70b
CMD: /home/intern01/CM/repos/local/cache/def32291fe4247de/mlperf/bin/python3 main.py --scenario Offline --dataset-path /home/intern01/CM/repos/local/cache/b4603ed8799641d8/open_orca/open_orca_gpt4_tokenized_llama.sampled_24576.pkl.gz --device rocm --total-sample-count 10 --user-conf '/home/intern01/CM/repos/mlcommons@mlperf-automations/script/generate-mlperf-inference-user-conf/tmp/8b4fd7479b754685ab7d620e3a9af93e.conf' --output-log-dir /home/intern01/CM/repos/local/cache/0b04afd372744cef/test_results/gn005-reference-rocm-pytorch-v2.6.0.dev20241122-scc24-base/llama2-70b-99/offline/performance/run_1 --dtype float16 --model-path /home/intern01/Llama-2-70b-chat-hf 2>&1 | tee '/home/intern01/CM/repos/local/cache/0b04afd372744cef/test_results/gn005-reference-rocm-pytorch-v2.6.0.dev20241122-scc24-base/llama2-70b-99/offline/performance/run_1/console.out'; echo \${PIPESTATUS[0]} > exitstatus
INFO:root: ! cd /home/intern01/CM/repos/local/cache/dd75d90466a24ac1
INFO:root: ! call /home/intern01/CM/repos/mlcommons@mlperf-automations/script/benchmark-program/run.sh from tmp-run.sh
/home/intern01/CM/repos/local/cache/def32291fe4247de/mlperf/bin/python3 main.py --scenario Offline --dataset-path /home/intern01/CM/repos/local/cache/b4603ed8799641d8/open_orca/open_orca_gpt4_tokenized_llama.sampled_24576.pkl.gz --device rocm --total-sample-count 10 --user-conf '/home/intern01/CM/repos/mlcommons@mlperf-automations/script/generate-mlperf-inference-user-conf/tmp/8b4fd7479b754685ab7d620e3a9af93e.conf' --output-log-dir /home/intern01/CM/repos/local/cache/0b04afd372744cef/test_results/gn005-reference-rocm-pytorch-v2.6.0.dev20241122-scc24-base/llama2-70b-99/offline/performance/run_1 --dtype float16 --model-path /home/intern01/Llama-2-70b-chat-hf 2>&1 | tee '/home/intern01/CM/repos/local/cache/0b04afd372744cef/test_results/gn005-reference-rocm-pytorch-v2.6.0.dev20241122-scc24-base/llama2-70b-99/offline/performance/run_1/console.out'; echo ${PIPESTATUS[0]} > exitstatus
usage: main.py [-h] [--scenario {Offline,Server}] [--model-path MODEL_PATH]
[--dataset-path DATASET_PATH] [--accuracy] [--dtype DTYPE]
[--device {cpu,cuda:0}] [--audit-conf AUDIT_CONF]
[--user-conf USER_CONF]
[--total-sample-count TOTAL_SAMPLE_COUNT]
[--batch-size BATCH_SIZE] [--output-log-dir OUTPUT_LOG_DIR]
[--enable-log-trace] [--num-workers NUM_WORKERS] [--vllm]
[--api-model-name API_MODEL_NAME] [--api-server API_SERVER]
main.py: error: argument --device: invalid choice: 'rocm' (choose from 'cpu', 'cuda:0')
CM error: Portable CM script failed (name = benchmark-program, return code = 512)
Could you please advise on how to enable or fix the rocm support for this benchmark? Thanks