Skip to content

Save the docker with Resnet50 running successfully and load it on another system with the same config but fail to run Resnet50 #2023

Open
@Bob123Yang

Description

Hi @arjunsuresh I have encountered one issue for the docker migration.

I run the below command in the system A to build the docker successfully and run the Resnet50 inference in the docker successfully. Then I save the docker as the docker-with-test-successfully-1.tar.

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev
--model=resnet50
--implementation=nvidia
--framework=tensorrt
--category=edge
--scenario=Offline
--execution_mode=test
--device=cuda
--docker --quiet
--test_query_count=1000

I loaded it on another system B with almost the same configuration and run the same Resnet50 inference again as below but failed with the below log. I'm not sure is there any limitation for the docker migration I should care about.

bob@Bob-Tomcat-Product:~$ docker images
REPOSITORY                        TAG       IMAGE ID       CREATED      SIZE
docker-with-test-successfully-1   latest    2b63d4ccc258   9 days ago   35.5GB
bob@Bob-Tomcat-Product:~$ docker run -it docker-with-test-successfully-1:latest /bin/bash
cmuser@d37b940a1f0a:~$ ls
CM  cm-run-script-versions.json  configs  hardware  version_info.json
cmuser@d37b940a1f0a:~$    cm run script --tags=run-mlperf,inference,_r4.1-dev \
>    --model=resnet50 \
>    --implementation=nvidia \
>    --framework=tensorrt \
>    --category=edge \
>    --scenario=Offline \
>    --execution_mode=valid \
>    --device=cuda \
>    --division=closed \
>    --rerun \
>    --quiet
INFO:root:* cm run script "run-mlperf inference _r4.1-dev"
INFO:root:  * cm run script "detect os"
INFO:root:         ! cd /home/cmuser
INFO:root:         ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os/run.sh from tmp-run.sh
INFO:root:         ! call "postprocess" from /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os/customize.py
INFO:root:  * cm run script "detect cpu"
INFO:root:    * cm run script "detect os"
INFO:root:           ! cd /home/cmuser
INFO:root:           ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os/run.sh from tmp-run.sh
INFO:root:           ! call "postprocess" from /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os/customize.py
INFO:root:         ! cd /home/cmuser
INFO:root:         ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-cpu/run.sh from tmp-run.sh
INFO:root:         ! call "postprocess" from /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-cpu/customize.py
INFO:root:  * cm run script "get python3"
INFO:root:       ! load /home/cmuser/CM/repos/local/cache/bba8cf8097b64518/cm-cached-state.json
INFO:root:Path to Python: /usr/bin/python3
INFO:root:Python version: 3.8.10
INFO:root:  * cm run script "get mlcommons inference src"
INFO:root:       ! load /home/cmuser/CM/repos/local/cache/21f79a83541549b7/cm-cached-state.json
INFO:root:  * cm run script "get sut description"
INFO:root:    * cm run script "detect os"
INFO:root:           ! cd /home/cmuser
INFO:root:           ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os/run.sh from tmp-run.sh
INFO:root:           ! call "postprocess" from /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os/customize.py
INFO:root:    * cm run script "detect cpu"
INFO:root:      * cm run script "detect os"
INFO:root:             ! cd /home/cmuser
INFO:root:             ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os/run.sh from tmp-run.sh
INFO:root:             ! call "postprocess" from /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-os/customize.py
INFO:root:           ! cd /home/cmuser
INFO:root:           ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-cpu/run.sh from tmp-run.sh
INFO:root:           ! call "postprocess" from /home/cmuser/CM/repos/mlcommons@cm4mlops/script/detect-cpu/customize.py
INFO:root:    * cm run script "get python3"
INFO:root:         ! load /home/cmuser/CM/repos/local/cache/bba8cf8097b64518/cm-cached-state.json
INFO:root:Path to Python: /usr/bin/python3
INFO:root:Python version: 3.8.10
INFO:root:    * cm run script "get compiler"
INFO:root:         ! load /home/cmuser/CM/repos/local/cache/6285b87ff0f74d8a/cm-cached-state.json
INFO:root:    * cm run script "get cuda-devices _with-pycuda"
INFO:root:      * cm run script "get cuda _toolkit"
INFO:root:           ! load /home/cmuser/CM/repos/local/cache/b5a3a8af88c14cc7/cm-cached-state.json
INFO:root:ENV[CM_CUDA_PATH_LIB_CUDNN_EXISTS]: no
INFO:root:ENV[CM_CUDA_VERSION]: 12.2
INFO:root:ENV[CM_CUDA_VERSION_STRING]: cu122
INFO:root:ENV[CM_NVCC_BIN_WITH_PATH]: /usr/local/cuda/bin/nvcc
INFO:root:ENV[CUDA_HOME]: /usr/local/cuda
INFO:root:      * cm run script "get python3"
INFO:root:           ! load /home/cmuser/CM/repos/local/cache/bba8cf8097b64518/cm-cached-state.json
INFO:root:Path to Python: /usr/bin/python3
INFO:root:Python version: 3.8.10
INFO:root:      * cm run script "get generic-python-lib _package.pycuda"
INFO:root:        * cm run script "get python3"
INFO:root:             ! load /home/cmuser/CM/repos/local/cache/bba8cf8097b64518/cm-cached-state.json
INFO:root:Path to Python: /usr/bin/python3
INFO:root:Python version: 3.8.10
INFO:root:             ! cd /home/cmuser
INFO:root:             ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-generic-python-lib/validate_cache.sh from tmp-run.sh
INFO:root:             ! call "detect_version" from /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-generic-python-lib/customize.py
            Detected version: 2022.2.2
INFO:root:        * cm run script "get python3"
INFO:root:             ! load /home/cmuser/CM/repos/local/cache/bba8cf8097b64518/cm-cached-state.json
INFO:root:Path to Python: /usr/bin/python3
INFO:root:Python version: 3.8.10
INFO:root:           ! load /home/cmuser/CM/repos/local/cache/a29ea6efe3564a4b/cm-cached-state.json
INFO:root:      * cm run script "get generic-python-lib _package.numpy"
INFO:root:        * cm run script "get python3"
INFO:root:             ! load /home/cmuser/CM/repos/local/cache/bba8cf8097b64518/cm-cached-state.json
INFO:root:Path to Python: /usr/bin/python3
INFO:root:Python version: 3.8.10
INFO:root:             ! cd /home/cmuser
INFO:root:             ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-generic-python-lib/validate_cache.sh from tmp-run.sh
INFO:root:             ! call "detect_version" from /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-generic-python-lib/customize.py
            Detected version: 1.23.5
INFO:root:        * cm run script "get python3"
INFO:root:             ! load /home/cmuser/CM/repos/local/cache/bba8cf8097b64518/cm-cached-state.json
INFO:root:Path to Python: /usr/bin/python3
INFO:root:Python version: 3.8.10
INFO:root:           ! load /home/cmuser/CM/repos/local/cache/19ca7b3b57a74cd2/cm-cached-state.json
INFO:root:           ! cd /home/cmuser
INFO:root:           ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-cuda-devices/detect.sh from tmp-run.sh
Traceback (most recent call last):
  File "/home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-cuda-devices/detect.py", line 1, in <module>
    import pycuda.driver as cuda
  File "/home/cmuser/.local/lib/python3.8/site-packages/pycuda/driver.py", line 66, in <module>
    from pycuda._driver import *  # noqa
ImportError: /lib/x86_64-linux-gnu/libcuda.so.1: file too short

CM error: Portable CM script failed (name = get-cuda-devices, return code = 256)


^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Note that it is often a portability issue of a third-party tool or a native script
wrapped and unified by this CM script (automation recipe). Please re-run
this script with --repro flag and report this issue with the original
command line, cm-repro directory and full log here:

https://github.com/mlcommons/cm4mlops/issues

The CM concept is to collaboratively fix such issues inside portable CM scripts
to make existing tools and native scripts more portable, interoperable
and deterministic. Thank you!
cmuser@d37b940a1f0a:~$ 

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions