Skip to content

Fail to build the GPT-J docker after successfully installation of tensorrt-llm #2022

Open
@Bob123Yang

Description

Hi @arjunsuresh

When I was running the below command to build the docker for GPT-J:

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r5.0-dev
--model=gptj-99
--implementation=nvidia
--framework=tensorrt
--category=edge
--scenario=Offline
--execution_mode=test
--device=cuda
--docker --quiet
--test_query_count=50

I got the failure as below, I'm not sure if it is related the existing docker (built for Resnet50 several days before) or not?

Successfully installed tensorrt-llm

[notice] A new release of pip is available: 23.3.1 -> 24.3.1
[notice] To update, run: python3 -m pip install --upgrade pip
Initializing model from /mnt/models/GPTJ-6B/checkpoint-final
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:16<00:00,  5.48s/it]
[TensorRT-LLM][WARNING] The manually set model data type is torch.float16, but the data type of the HuggingFace model is torch.float32.
Initializing tokenizer from /mnt/models/GPTJ-6B/checkpoint-final
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading calibration dataset
Traceback (most recent call last):
  File "/code/tensorrt_llm/examples/quantization/quantize.py", line 363, in <module>
    main(args)
  File "/code/tensorrt_llm/examples/quantization/quantize.py", line 255, in main
    calib_dataloader = get_calib_dataloader(
  File "/code/tensorrt_llm/examples/quantization/quantize.py", line 187, in get_calib_dataloader
    dataset = load_dataset("cnn_dailymail", name="3.0.0", split="train")
  File "/home/bob1/.local/lib/python3.10/site-packages/datasets/load.py", line 2129, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/bob1/.local/lib/python3.10/site-packages/datasets/load.py", line 1849, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "/home/bob1/.local/lib/python3.10/site-packages/datasets/load.py", line 1731, in dataset_module_factory
    raise e1 from None
  File "/home/bob1/.local/lib/python3.10/site-packages/datasets/load.py", line 1618, in dataset_module_factory
    raise ConnectionError(f"Couldn't reach '{path}' on the Hub ({e.__class__.__name__})") from e
ConnectionError: Couldn't reach 'cnn_dailymail' on the Hub (LocalEntryNotFoundError)
make: *** [Makefile:102: devel_run] Error 1
make: Leaving directory '/home/bob1/CM/repos/local/cache/2479e8f0ba164d4c/repo/docker'

CM error: Portable CM script failed (name = get-ml-model-gptj, return code = 256)


^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Note that it is often a portability issue of a third-party tool or a native script
wrapped and unified by this CM script (automation recipe). Please re-run
this script with --repro flag and report this issue with the original
command line, cm-repro directory and full log here:

https://github.com/mlcommons/cm4mlops/issues

The CM concept is to collaboratively fix such issues inside portable CM scripts
to make existing tools and native scripts more portable, interoperable
and deterministic. Thank you!

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions