Skip to content

Segfault in new_episode() during multi-agent asynchronous training #693

@lmBored

Description

@lmBored

Hi, I often get segfaults (signal 11) during game.new_episode() when running multiple multi-agent environments asynchronously. The crash is in the native ViZDoom process (not Python), it gives ViZDoomErrorException: Unexpected ViZDoom instance crash.

Specs

  • ViZDoom: 1.3.0
  • Python: 3.11.14 (CPython)
  • OS: Rocky Linux 8.10 (Green Obsidian)
  • Kernel: 4.18.0-553.104.1.el8_10.x86_64
  • glibc: 2.28
  • Architecture: x86_64
  • GPU: NVIDIA (CUDA 11.8, not used by ViZDoom itself)
  • Display: Headless (no X server, window_visible = false)

Env details

  • I used Sample Factory
  • Each multiplayer "game" consists of 2 ViZDoom processes (host + join) communicating via UDP on localhost
  • 8 such games run per worker process in threads
  • Each ViZDoom process is a separate OS process spawned by game.init()
  • UDP ports are allocated sequentially: 41300, 41301, ..., 41307 for worker 0; 41400, ..., 41407 for worker 1
  • Scenario: Custom multiplayer WAD
  • 32 ViZDoom processes
doom_scenario_path = asdf.wad
living_reward = 0
screen_resolution = RES_320X240
screen_format = CRCGCB
render_hud = true
render_crosshair = true
render_weapon = true
render_decals = false
render_particles = false
window_visible = false
episode_timeout = 3500
available_buttons = { MOVE_FORWARD MOVE_BACKWARD MOVE_RIGHT MOVE_LEFT TURN_LEFT TURN_RIGHT ATTACK SELECT_WEAPON1 SELECT_WEAPON2 }
available_game_variables = { AMMO1 AMMO2 HEALTH USER1 KILLCOUNT HITCOUNT WEAPON1 WEAPON2 POSITION_X POSITION_Y }
mode = PLAYER

Crash details

I tried running 3 times, all 3 of them have the same issue. Crash usually happens when terminated=True for all agents, game.new_episode() is called to reset

Run 1

*** Fatal Error ***
Address not mapped to object (signal 11)
Address: 0x30

Run 2

*** Fatal Error ***
Segmentation fault (signal 11)
Address: (nil)

Run 3

*** Fatal Error ***
Segmentation fault (signal 11)
Address: (nil)

Python traceback

All three crashes have the same traceback:

Exception in thread Thread-N (start):
Traceback (most recent call last):
  File ".../threading.py", line 1045, in _bootstrap_inner
    self.run()
  File ".../threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File ".../doom_multiagent_wrapper.py", line 160, in start
    results = env.reset(**data) if data else env.reset()
                                             ^^^^^^^^^^^
  File ".../gymnasium/core.py", line 515, in reset
    obs, info = self.env.reset(seed=seed, options=options)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../wrappers/scenario_wrappers/armory_siege.py", line 114, in reset
    obs, info = self.env.reset(**kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../gymnasium/core.py", line 515, in reset
    obs, info = self.env.reset(seed=seed, options=options)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../env_wrappers.py", line 115, in reset
    return self.env.reset(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../env_wrappers.py", line 83, in reset
    obs, info = self.env.reset(**kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../gymnasium/core.py", line 467, in reset
    return self.env.reset(seed=seed, options=options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../doom_multiagent.py", line 140, in reset
    obs, info = super().reset(**kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File ".../doom_gym.py", line 346, in reset
    self.game.new_episode()
vizdoom.vizdoom.ViZDoomErrorException: Unexpected ViZDoom instance crash.

vizdoom-crash.log

I think crash is likely from 0x0000000000491398

*** Fatal Error ***
Segmentation fault (signal 11)
Address: (nil)

System: Linux asdf.cluster 4.18.0-553.104.1.el8_10.x86_64 #1 SMP Fri Feb 13 15:51:56 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux

Executing: gdb --quiet --batch --command=gdb-respfile-CngZn6

Executing: gdb --quiet --batch --command=gdb-respfile-CngZn6
[New LWP 2425036]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x0000155554dbf312 in waitpid () from /lib64/libpthread.so.0

* Loaded Libraries
From                To                  Syms Read   Shared Object Library
0x0000155554fe91d0  0x00001555550bec72  Yes (*)     /vast.mnt/home/20231193/ViZDoom/.venv311/lib/python3.11/site-packages/vizdoom/../vizdoom.libs/libSDL2-2-d214786e.0.so.0.10.0
0x0000155554db3840  0x0000155554dc1985  Yes (*)     /lib64/libpthread.so.0
0x0000155554ba7430  0x0000155554baa770  Yes (*)     /lib64/librt.so.1
0x000015555498f740  0x000015555499cb27  Yes (*)     /lib64/libz.so.1
0x000015555475f880  0x000015555476e667  Yes (*)     /vast.mnt/home/20231193/ViZDoom/.venv311/lib/python3.11/site-packages/vizdoom/../vizdoom.libs/libboost_thread-473b539f.so.1.66.0
0x0000155554547c50  0x0000155554548df8  Yes (*)     /vast.mnt/home/20231193/ViZDoom/.venv311/lib/python3.11/site-packages/vizdoom/../vizdoom.libs/libboost_system-69a6c43e.so.1.66.0
0x0000155554336fe0  0x000015555433bf92  Yes (*)     /vast.mnt/home/20231193/ViZDoom/.venv311/lib/python3.11/site-packages/vizdoom/../vizdoom.libs/libboost_date_time-bfc31cd3.so.1.66.0
0x0000155554124960  0x0000155554127700  Yes (*)     /vast.mnt/home/20231193/ViZDoom/.venv311/lib/python3.11/site-packages/vizdoom/../vizdoom.libs/libboost_chrono-67089fad.so.1.66.0
0x0000155553f1e760  0x0000155553f1e8b5  Yes (*)     /vast.mnt/home/20231193/ViZDoom/.venv311/lib/python3.11/site-packages/vizdoom/../vizdoom.libs/libboost_atomic-fb1368c6.so.1.66.0
0x0000155553d1ae70  0x0000155553d1ba82  Yes (*)     /lib64/libdl.so.2
0x0000155553a14b90  0x0000155553acaae2  Yes (*)     /lib64/libstdc++.so.6
0x000015555360f520  0x00001555536ae80a  Yes (*)     /lib64/libm.so.6
0x00001555533ede00  0x00001555533fea55  Yes (*)     /lib64/libgcc_s.so.1
0x0000155553036cc0  0x0000155553192f0d  Yes (*)     /lib64/libc.so.6
0x0000155555325080  0x0000155555349447  Yes         /lib64/ld-linux-x86-64.so.2
0x0000155552d85b30  0x0000155552de7844  Yes (*)     /lib64/libudev.so.1
0x0000155552b2cab0  0x0000155552b63012  Yes (*)     /lib64/libmount.so.1
0x00001555528d6740  0x00001555529080f2  Yes (*)     /lib64/libblkid.so.1
0x00001555526c4ac0  0x00001555526c87e1  Yes (*)     /lib64/libuuid.so.1
0x000015555249fa80  0x00001555524b783f  Yes (*)     /lib64/libselinux.so.1
0x0000155552216380  0x0000155552271973  Yes (*)     /lib64/libpcre2-8.so.0
(*): Shared library is missing debugging information.

* Threads
  Id   Target Id                                      Frame 
* 1    Thread 0x155555536740 (LWP 2425034) "vizdoom"  0x0000155554dbf312 in waitpid () from /lib64/libpthread.so.0
  2    Thread 0x155552213700 (LWP 2425036) "SDLTimer" 0x0000155554dbda46 in do_futex_wait.constprop () from /lib64/libpthread.so.0

* FPU Status
  R7: Empty   0x00000000000000000000
  R6: Empty   0x00000000000000000000
  R5: Empty   0x00000000000000000000
  R4: Empty   0x00000000000000000000
  R3: Empty   0x00000000000000000000
  R2: Empty   0x00000000000000000000
  R1: Empty   0x00000000000000000000
=>R0: Empty   0x00000000000000000000

Status Word:         0x0000                                            
                       TOP: 0
Control Word:        0x037f   IM DM ZM OM UM PM
                       PC: Extended Precision (64-bits)
                       RC: Round to nearest
Tag Word:            0xffff
Instruction Pointer: 0x00:0x00000000
Operand Pointer:     0x00:0x00000000
Opcode:              0x0000

* Registers
rax            0xfffffffffffffe00  -512
rbx            0x251413            2429971
rcx            0x155554dbf312      23456240104210
rdx            0x0                 0
rsi            0x966c74            9858164
rdi            0x251413            2429971
rbp            0x966c74            0x966c74
rsp            0x966c40            0x966c40
r8             0x0                 0
r9             0x5                 5
r10            0x0                 0
r11            0x246               582
r12            0x0                 0
r13            0x9648a0            9848992
r14            0xc                 12
r15            0x1090              4240
rip            0x155554dbf312      0x155554dbf312 <waitpid+82>
eflags         0x246               [ PF ZF IF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
k0             0x0                 0
k1             0x0                 0
k2             0x0                 0
k3             0x0                 0
k4             0x0                 0
k5             0x0                 0
k6             0x0                 0
k7             0x0                 0

* Backtrace

Thread 2 (Thread 0x155552213700 (LWP 2425036)):
#0  0x0000155554dbda46 in do_futex_wait.constprop () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x0000155554dbdb38 in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
No symbol table info available.
#2  0x00001555550b6f22 in SDL_SemWait_REAL () from /vast.mnt/home/20231193/ViZDoom/.venv311/lib/python3.11/site-packages/vizdoom/../vizdoom.libs/libSDL2-2-d214786e.0.so.0.10.0
No symbol table info available.
#3  0x00001555550b7065 in SDL_SemWaitTimeout_REAL () from /vast.mnt/home/20231193/ViZDoom/.venv311/lib/python3.11/site-packages/vizdoom/../vizdoom.libs/libSDL2-2-d214786e.0.so.0.10.0
No symbol table info available.
#4  0x000015555503be4a in SDL_TimerThread () from /vast.mnt/home/20231193/ViZDoom/.venv311/lib/python3.11/site-packages/vizdoom/../vizdoom.libs/libSDL2-2-d214786e.0.so.0.10.0
No symbol table info available.
#5  0x000015555503b8c0 in SDL_RunThread () from /vast.mnt/home/20231193/ViZDoom/.venv311/lib/python3.11/site-packages/vizdoom/../vizdoom.libs/libSDL2-2-d214786e.0.so.0.10.0
No symbol table info available.
#6  0x00001555550b6bed in RunThread () from /vast.mnt/home/20231193/ViZDoom/.venv311/lib/python3.11/site-packages/vizdoom/../vizdoom.libs/libSDL2-2-d214786e.0.so.0.10.0
No symbol table info available.
#7  0x0000155554db51ca in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#8  0x000015555304e953 in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 1 (Thread 0x155555536740 (LWP 2425034)):
#0  0x0000155554dbf312 in waitpid () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00000000004404fc in ?? ()
No symbol table info available.
#2  <signal handler called>
No symbol table info available.
#3  0x0000000000491398 in ?? ()
No symbol table info available.
#4  0x0000000000491de1 in ?? ()
No symbol table info available.
#5  0x0000000000494b45 in ?? ()
No symbol table info available.
#6  0x0000000000494c6e in ?? ()
No symbol table info available.
#7  0x0000000000569f79 in ?? ()
No symbol table info available.
#8  0x00000000004a4146 in ?? ()
No symbol table info available.
#9  0x000000000047f4fe in ?? ()
No symbol table info available.
#10 0x00000000004763af in ?? ()
No symbol table info available.
#11 0x0000000000478ca0 in ?? ()
No symbol table info available.
#12 0x00000000004204fc in ?? ()
No symbol table info available.
#13 0x000015555304f865 in __libc_start_main () from /lib64/libc.so.6
No symbol table info available.
#14 0x000000000043f76e in ?? ()
No symbol table info available.
[Inferior 1 (process 2425034) detached]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions