Skip to content

Floating point exception on Blackwell GPU #9

@brianhuang-commits

Description

@brianhuang-commits

I tried to run this project on my RTX 5070Ti on Ubuntu22.04, and this floating point exception occurred. I assumed it was probably due to the version mismatch of CUDA and pytorch on the latest Blackwell gpu (sm_120).
I also tried modifying the dockerfile to build the environment using the packages with newer supported versions, but faced other issues such as not being able to find the atari roms as included in the pre-built image (Step 7/11 : COPY atari57/ /opt/atari57 COPY failed: file not found in build context or excluded by .dockerignore: stat atari57/: file does not exist)

Below is the error message of the floating point exception:

[2025/09/12_01:21:49.578] [Version] 04a589

[2025/09/12_01:21:49.578] Server initialize over.

[2025/09/12_01:21:49.578] [Iteration] =====1=====

[2025/09/12_01:21:49.578] [SelfPlay] Start 0

connect success

Info docker-desktop_0 sp

[2025/09/12_01:21:51.083] [Worker Connection] docker-desktop_0 sp

CUDA_VISIBLE_DEVICES=0 build/tictactoe/minizero_tictactoe -mode sp -conf_file tictactoe_az_1bx256_n50-04a589/tictactoe_az_1bx256_n50-04a589.cfg -conf_str "nn_file_name=tictactoe_az_1bx256_n50-04a589/model/weight_iter_0.pt:program_auto_seed=false:program_seed=581869302:zero_training_directory=tictactoe_az_1bx256_n50-04a589:zero_num_threads=4:zero_num_parallel_games=64:program_quiet=true"

[2025/09/12_01:21:51.087] [Log] docker-desktop_0 sp: CUDA_VISIBLE_DEVICES=0 build/tictactoe/minizero_tictactoe -mode sp -conf_file tictactoe_az_1bx256_n50-04a589/tictactoe_az_1bx256_n50-04a589.cfg -conf_str "nn_file_name=tictactoe_az_1bx256_n50-04a589/model/weight_iter_0.pt:program_auto_seed=false:program_seed=581869302:zero_training_directory=tictactoe_az_1bx256_n50-04a589:zero_num_threads=4:zero_num_parallel_games=64:program_quiet=true"

connect success

Info docker-desktop_0 op

[2025/09/12_01:21:52.071] [Worker Connection] docker-desktop_0 op

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=. python minizero/learner/train.py tictactoe tictactoe_az_1bx256_n50-04a589 tictactoe_az_1bx256_n50-04a589/tictactoe_az_1bx256_n50-04a589.cfg

[2025/09/12_01:21:52.073] [Log] docker-desktop_0 op: CUDA_VISIBLE_DEVICES=0 PYTHONPATH=. python minizero/learner/train.py tictactoe tictactoe_az_1bx256_n50-04a589 tictactoe_az_1bx256_n50-04a589/tictactoe_az_1bx256_n50-04a589.cfg

scripts/zero-worker.sh: line 192: 995 Floating point exception(core dumped) CUDA_VISIBLE_DEVICES=${cuda_devices} ${sp_executable_file} -conf_file ${CONF_FILE} -conf_str "${CONF_STR}" -mode sp 0<&$broker_fd 1>&$broker_fd

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions