The expected behavior is for the training to go error-free, just like in synchronous training
root@025575bc0d1646e18ff8810e88261f03000000:/# bpslaunch
BytePS launching server
Command: gdb -ex 'run' -ex 'bt' -batch --args python3.8 -c 'import byteps.server'
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[2021-01-18 03:00:55.980567: I byteps/common/compressor/compressor_registry.cc:28] dithering_compressor compressor is registered
[2021-01-18 03:00:55.980615: I byteps/common/compressor/compressor_registry.cc:28] onebit_compressor compressor is registered
[2021-01-18 03:00:55.980635: I byteps/common/compressor/compressor_registry.cc:28] randomk_compressor compressor is registered
[2021-01-18 03:00:55.980650: I byteps/common/compressor/compressor_registry.cc:28] topk_compressor compressor is registered
[2021-01-18 03:00:55.980663: I byteps/common/compressor/compressor_registry.cc:28] vanilla_ef compressor is registered
[03:00:55] byteps/server/server.cc:419: BytePS server is enabled asynchronous training
[03:00:55] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[New Thread 0x7f0da1300700 (LWP 27)]
[03:00:55] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024
[03:00:55] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4
[New Thread 0x7f0da0aff700 (LWP 28)]
[New Thread 0x7f0da02fe700 (LWP 29)]
[New Thread 0x7f0d9fafd700 (LWP 30)]
[New Thread 0x7f0d9f2fc700 (LWP 31)]
[New Thread 0x7f0d9eafb700 (LWP 32)]
[New Thread 0x7f0d9e2fa700 (LWP 33)]
[[03:00:55] src/van.cc:389: Bind to role=server, ip=10.0.0.4, port=33261, is_recovery=0
03:00:55] src/./zmq_van.h:287: Start ZMQ recv thread
[New Thread 0x7f0d9daf9700 (LWP 34)]
[03:00:55] src/van.cc:478: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.0.0.4, port=33261, is_recovery=0 } }. THIS IS NOT DATA MSG!
[03:01:26] src/van.cc:503: 1 => 2147483647. Meta: request=0, timestamp=4, control={ cmd=ADD_NODE, node={ role=worker, id=9, ip=10.0.0.8, port=60029, is_recovery=0 role=worker, id=11, ip=10.0.0.6, port=39183, is_recovery=0 role=worker, id=13, ip=10.0.0.5, port=60011, is_recovery=0 role=server, id=8, ip=10.0.0.4, port=33261, is_recovery=0 role=worker, id=15, ip=10.0.0.4, port=52559, is_recovery=0 role=scheduler, id=1, ip=10.0.0.4, port=12345, is_recovery=0 } }. THIS IS NOT DATA MSG!
[03:01:26] src/van.cc:338: S[8] is connected to others
[New Thread 0x7f0d9d2f8700 (LWP 35)]
[03:01:26] src/van.cc:478: ? => 1. Meta: request=1, timestamp=1, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG!
[Thread 0x7f0d9d2f8700 (LWP 35) exited]
[03:01:27] src/van.cc:503: 1 => 8. Meta: request=0, timestamp=10, control={ cmd=BARRIER, barrier_group=0 }. THIS IS NOT DATA MSG!
[03:01:27] src/van.cc:478: ? => 1. Meta: request=1, timestamp=2, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG!
[03:01:27] src/van.cc:503: 15 => 8. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=140364908230080, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:503: 9 => 8. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=139792355468736, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:503: 11 => 8. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=140285773407680, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:503: 13 => 8. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=139663003149760, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[New Thread 0x7f0d9caf7700 (LWP 36)]
[New Thread 0x7f0d7ffff700 (LWP 37)]
[New Thread 0x7f0d7f7fe700 (LWP 38)]
[03:01:27] src/van.cc:478: ? => 15. Meta: request=0, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=140364908230080
[03:01:27] src/van.cc:478: ? => 9. Meta: request=0, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=139792355468736
[03:01:27] src/van.cc:478: ? => 11. Meta: request=0, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=140285773407680
[03:01:27] src/van.cc:478: ? => 13. Meta: request=0, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=139663003149760
[03:01:27] src/van.cc:503: 11 => 8. Meta: request=1, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:503: 15 => 8. Meta: request=1, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:478: ? => 11. Meta: request=0, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0
[03:01:27] src/van.cc:478: ? => 15. Meta: request=0, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0
[03:01:27] src/van.cc:503: 13 => 8. Meta: request=1, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:503: 9 => 8. Meta: request=1, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:478: ? => 13. Meta: request=0, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0
[03:01:27] src/van.cc:478: ? => 9. Meta: request=0, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0
[03:01:27] src/van.cc:503: 15 => 8. Meta: request=1, timestamp=2, app_id=0, customer_id=0, simple_app=0, push=0, head=0, key=0, data_type={ UINT64 OTHER } Body: data_size=8 data_size=0
[03:01:27] src/van.cc:503: 11 => 8. Meta: request=1, timestamp=2, app_id=0, customer_id=0, simple_app=0, push=0, head=0, key=0, data_type={ UINT64 OTHER } Body: data_size=8 data_size=0
[03:01:27] 3rdparty/ps-lite/include/dmlc/logging.h:276: [03:01:27] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 524288 first
Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x229fc) [0x7f0da25bc9fc]
[bt] (1) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x22e3d) [0x7f0da25bce3d]
[bt] (2) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer<char>*)+0x2b2) [0x7f0da25b6472]
[bt] (3) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(byteps::server::BytePSHandler(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)+0x8f2) [0x7f0da25b8562]
[bt] (4) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x24b8e) [0x7f0da25beb8e]
[bt] (5) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x41d89) [0x7f0da25dbd89]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7f0da1c9a6df]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f0da39046db]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f0da3c3da3f]
[terminate called after throwing an instance of '03:01:27] src/van.cc:dmlc::Error'
503: what(): 13 => 8. Meta: request=1, timestamp=2, app_id=0, customer_id=0, simple_app=0, push=0, head=0, key=0, data_type={ UINT64 OTHER } Body: data_size=8 data_size=0
[03:01:27] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 524288 first
Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x229fc) [0x7f0da25bc9fc]
[bt] (1) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x22e3d) [0x7f0da25bce3d]
[bt] (2) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer<char>*)+0x2b2) [0x7f0da25b6472]
[bt] (3) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(byteps::server::BytePSHandler(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)+0x8f2) [0x7f0da25b8562]
[bt] (4) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x24b8e) [0x7f0da25beb8e]
[bt] (5) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x41d89) [0x7f0da25dbd89]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7f0da1c9a6df]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f0da39046db]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f0da3c3da3f]
[03:01:27] src/van.cc:503: 9 => 8. Meta: request=1, timestamp=2, app_id=0, customer_id=0, simple_app=0, push=0, head=0, key=0, data_type={ UINT64 OTHER } Body: data_size=8 data_size=0
Thread 2 "python3.8" received signal SIGABRT, Aborted.
[Switching to Thread 0x7f0da1300700 (LWP 27)]
0x00007f0da3b5af47 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0 0x00007f0da3b5af47 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f0da3b5c8b1 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f0da1c69957 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007f0da1c6fae6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007f0da1c6fb21 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007f0da1c6fd54 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007f0da25bcf27 in dmlc::LogMessageFatal::~LogMessageFatal (this=0x7f0da12ff310, __in_chrg=<optimized out>) at 3rdparty/ps-lite/include/dmlc/logging.h:277
#7 0x00007f0da25b6472 in byteps::server::SendPullResponse (type=..., type@entry=..., key=<optimized out>, req_meta=..., server=server@entry=0x2018b60) at byteps/server/server.cc:52
#8 0x00007f0da25b8562 in byteps::server::BytePSHandler (req_meta=..., req_data=..., server=0x2018b60) at byteps/server/server.cc:376
#9 0x00007f0da25beb8e in std::function<void (ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)>::operator()(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*) const (__args#2=<optimized out>, __args#1=..., __args#0=..., this=0x2018dc8) at /usr/include/c++/7/bits/std_function.h:706
#10 ps::KVServer<char>::Process (this=0x2018b60, msg=...) at 3rdparty/ps-lite/include/ps/kv_app.h:516
#11 0x00007f0da25dbd89 in std::function<void (ps::Message const&)>::operator()(ps::Message const&) const (__args#0=..., this=0x2018e88) at /usr/include/c++/7/bits/std_function.h:706
#12 ps::Customer::Receiving (this=0x2018e80) at src/customer.cc:64
#13 0x00007f0da1c9a6df in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#14 0x00007f0da39046db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#15 0x00007f0da3c3da3f in clone () from /lib/x86_64-linux-gnu/libc.so.6
root@025575bc0d1646e18ff8810e88261f03000000:/#
Describe the bug
Turning on asynchronous training (
export BYTEPS_ENABLE_ASYNC=1) crashes the bps server (during SendPullResponse in byteps/server/server.cc)Expected behavior
The expected behavior is for the training to go error-free, just like in synchronous training
Stack trace from the crashed server
These are produced by turning on
BYTEPS_ENABLE_GDB, settingBYTEPS_LOG_LEVELto INFO andPS_VERBOSEto 2To Reproduce
Steps to reproduce the behavior:
building the docker image
byteps set up
Environment (please complete the following information):
A few other things