Open
Description
Hi @arjunsuresh
When I was running the below command to build the docker for GPT-J:
cm run script --tags=run-mlperf,inference,_find-performance,_full,_r5.0-dev
--model=gptj-99
--implementation=nvidia
--framework=tensorrt
--category=edge
--scenario=Offline
--execution_mode=test
--device=cuda
--docker --quiet
--test_query_count=50
I got the failure as below, I'm not sure if it is related the existing docker (built for Resnet50 several days before) or not?
Successfully installed tensorrt-llm [notice] A new release of pip is available: 23.3.1 -> 24.3.1 [notice] To update, run: python3 -m pip install --upgrade pip Initializing model from /mnt/models/GPTJ-6B/checkpoint-final Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:16<00:00, 5.48s/it] [TensorRT-LLM][WARNING] The manually set model data type is torch.float16, but the data type of the HuggingFace model is torch.float32. Initializing tokenizer from /mnt/models/GPTJ-6B/checkpoint-final Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Loading calibration dataset Traceback (most recent call last): File "/code/tensorrt_llm/examples/quantization/quantize.py", line 363, in <module> main(args) File "/code/tensorrt_llm/examples/quantization/quantize.py", line 255, in main calib_dataloader = get_calib_dataloader( File "/code/tensorrt_llm/examples/quantization/quantize.py", line 187, in get_calib_dataloader dataset = load_dataset("cnn_dailymail", name="3.0.0", split="train") File "/home/bob1/.local/lib/python3.10/site-packages/datasets/load.py", line 2129, in load_dataset builder_instance = load_dataset_builder( File "/home/bob1/.local/lib/python3.10/site-packages/datasets/load.py", line 1849, in load_dataset_builder dataset_module = dataset_module_factory( File "/home/bob1/.local/lib/python3.10/site-packages/datasets/load.py", line 1731, in dataset_module_factory raise e1 from None File "/home/bob1/.local/lib/python3.10/site-packages/datasets/load.py", line 1618, in dataset_module_factory raise ConnectionError(f"Couldn't reach '{path}' on the Hub ({e.__class__.__name__})") from e ConnectionError: Couldn't reach 'cnn_dailymail' on the Hub (LocalEntryNotFoundError) make: *** [Makefile:102: devel_run] Error 1 make: Leaving directory '/home/bob1/CM/repos/local/cache/2479e8f0ba164d4c/repo/docker' CM error: Portable CM script failed (name = get-ml-model-gptj, return code = 256) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Note that it is often a portability issue of a third-party tool or a native script wrapped and unified by this CM script (automation recipe). Please re-run this script with --repro flag and report this issue with the original command line, cm-repro directory and full log here: https://github.com/mlcommons/cm4mlops/issues The CM concept is to collaboratively fix such issues inside portable CM scripts to make existing tools and native scripts more portable, interoperable and deterministic. Thank you!
Metadata
Assignees
Labels
No labels