-
Notifications
You must be signed in to change notification settings - Fork 79
Open
Description
Hi,
I am trying to run the GAN tutorial on MNIST (I made some minor modifications for my system):
import argparse
import lbann
import lbann.launcher
from gan_model import build_model
from mnist_dataset import make_data_reader
mini_batch_size = 128
num_epochs = 100
job_name = "gan"
trainer = lbann.Trainer(mini_batch_size)
model = build_model(num_epochs)
data_reader = make_data_reader()
opt = lbann.Adam(learn_rate=1e-4, beta1=0., beta2=0.99, eps=1e-8)
kwargs = {
"nodes": 1,
"scheduler" : "openmpi",
"setup_only" : True,
"time_limit" : 30,
}
lbann.run(trainer, model, data_reader, opt,
job_name=job_name,
**kwargs)which gives the batch script:
export IBV_FORK_SAFE=1
echo "Started at $(date)"
mpiexec -n 1 --map-by ppr:1:node -wdir /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/project/tutorials_lbann/gan/mnist/20231117_145903_gan_n1_ppn1 /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/lbann-latest/build_newompi3/install/bin/lbann --prototext=/lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/project/tutorials_lbann/gan/mnist/20231117_145903_gan_n1_ppn1/experiment.prototext
status=$?
echo "Finished at $(date)"
exit ${status}I get the error below (I already added export IBV_FORK_SAFE=1 to the batch.sh script produced):
--------------------------------------------------------------------------
WARNING: There are more than one active ports on host 'sqg2b16', but the
default subnet GID prefix was detected on more than one of these
ports. If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI. This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.
Please see this FAQ entry for more details:
http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.
The process that invoked fork was:
Local host: [[6305,1],0] (PID 56764)
If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
****************************************************************
Caught signal 11 (SIGSEGV - invalid memory reference) on rank 0
Stack trace:
0: lbann::stack_trace::get[abi:cxx11]()
1: lbann::exception::exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
2: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/lbann-latest/build_newompi3/install/lib64/liblbann.so.0.104.0(+0xc470a71) [0x2ad53e4f4a71] (could not find stack frame symbol)
3: /usr/lib64/libpthread.so.0(+0xf5d0) [0x2ad58bdc35d0] (could not find stack frame symbol)
4: std::_Hashtable<std::string, std::string, std::allocator<std::string>, std::__detail::_Identity, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, true, true> >::clear()
5: google::protobuf::DescriptorPool::FindFileByName(std::string const&) const
6: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/python3.7/site-packages/google/protobuf/pyext/_message.cpython-37m-x86_64-linux-gnu.so(+0xb8e7a) [0x2ad6193a9e7a] (could not find stack frame symbol)
7: _PyMethodDef_RawFastCallKeywords (demangling failed)
8: _PyMethodDescr_FastCallKeywords (demangling failed)
9: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6dbb5) [0x2ad587abcbb5] (could not find stack frame symbol)
10: _PyEval_EvalFrameDefault (demangling failed)
11: _PyEval_EvalCodeWithName (demangling failed)
12: PyEval_EvalCodeEx (demangling failed)
13: PyEval_EvalCode (demangling failed)
14: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x155c7e) [0x2ad587ba4c7e] (could not find stack frame symbol)
15: _PyMethodDef_RawFastCallDict (demangling failed)
16: _PyCFunction_FastCallDict (demangling failed)
17: _PyEval_EvalFrameDefault (demangling failed)
18: _PyEval_EvalCodeWithName (demangling failed)
19: _PyFunction_FastCallKeywords (demangling failed)
20: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
21: _PyEval_EvalFrameDefault (demangling failed)
22: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
23: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
24: _PyEval_EvalFrameDefault (demangling failed)
25: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
26: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
27: _PyEval_EvalFrameDefault (demangling failed)
28: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
29: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
30: _PyEval_EvalFrameDefault (demangling failed)
31: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
32: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x91fc9) [0x2ad587ae0fc9] (could not find stack frame symbol)
33: _PyObject_CallMethodIdObjArgs (demangling failed)
34: PyImport_ImportModuleLevelObject (demangling failed)
35: _PyEval_EvalFrameDefault (demangling failed)
36: _PyEval_EvalCodeWithName (demangling failed)
37: PyEval_EvalCodeEx (demangling failed)
38: PyEval_EvalCode (demangling failed)
39: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x155c7e) [0x2ad587ba4c7e] (could not find stack frame symbol)
40: _PyMethodDef_RawFastCallDict (demangling failed)
41: _PyCFunction_FastCallDict (demangling failed)
42: _PyEval_EvalFrameDefault (demangling failed)
43: _PyEval_EvalCodeWithName (demangling failed)
44: _PyFunction_FastCallKeywords (demangling failed)
45: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
46: _PyEval_EvalFrameDefault (demangling failed)
47: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
48: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
49: _PyEval_EvalFrameDefault (demangling failed)
50: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
51: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
52: _PyEval_EvalFrameDefault (demangling failed)
53: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
54: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
55: _PyEval_EvalFrameDefault (demangling failed)
56: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
57: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x91fc9) [0x2ad587ae0fc9] (could not find stack frame symbol)
58: _PyObject_CallMethodIdObjArgs (demangling failed)
59: PyImport_ImportModuleLevelObject (demangling failed)
60: _PyEval_EvalFrameDefault (demangling failed)
61: _PyEval_EvalCodeWithName (demangling failed)
62: PyEval_EvalCodeEx (demangling failed)
63: PyEval_EvalCode (demangling failed)
64: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x155c7e) [0x2ad587ba4c7e] (could not find stack frame symbol)
65: _PyMethodDef_RawFastCallDict (demangling failed)
66: _PyCFunction_FastCallDict (demangling failed)
67: _PyEval_EvalFrameDefault (demangling failed)
68: _PyEval_EvalCodeWithName (demangling failed)
69: _PyFunction_FastCallKeywords (demangling failed)
70: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
71: _PyEval_EvalFrameDefault (demangling failed)
72: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
73: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
74: _PyEval_EvalFrameDefault (demangling failed)
75: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
76: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
77: _PyEval_EvalFrameDefault (demangling failed)
78: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
79: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
80: _PyEval_EvalFrameDefault (demangling failed)
81: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
82: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x91fc9) [0x2ad587ae0fc9] (could not find stack frame symbol)
83: _PyObject_CallMethodIdObjArgs (demangling failed)
84: PyImport_ImportModuleLevelObject (demangling failed)
85: _PyEval_EvalFrameDefault (demangling failed)
86: _PyEval_EvalCodeWithName (demangling failed)
87: PyEval_EvalCodeEx (demangling failed)
88: PyEval_EvalCode (demangling failed)
89: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x155c7e) [0x2ad587ba4c7e] (could not find stack frame symbol)
90: _PyMethodDef_RawFastCallDict (demangling failed)
91: _PyCFunction_FastCallDict (demangling failed)
92: _PyEval_EvalFrameDefault (demangling failed)
93: _PyEval_EvalCodeWithName (demangling failed)
94: _PyFunction_FastCallKeywords (demangling failed)
95: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
96: _PyEval_EvalFrameDefault (demangling failed)
97: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
98: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
99: _PyEval_EvalFrameDefault (demangling failed)
100: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
101: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
102: _PyEval_EvalFrameDefault (demangling failed)
103: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
104: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
105: _PyEval_EvalFrameDefault (demangling failed)
106: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
107: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x91fc9) [0x2ad587ae0fc9] (could not find stack frame symbol)
108: _PyObject_CallMethodIdObjArgs (demangling failed)
109: PyImport_ImportModuleLevelObject (demangling failed)
110: _PyEval_EvalFrameDefault (demangling failed)
111: _PyEval_EvalCodeWithName (demangling failed)
112: PyEval_EvalCodeEx (demangling failed)
113: PyEval_EvalCode (demangling failed)
114: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x155c7e) [0x2ad587ba4c7e] (could not find stack frame symbol)
115: _PyMethodDef_RawFastCallDict (demangling failed)
116: _PyCFunction_FastCallDict (demangling failed)
117: _PyEval_EvalFrameDefault (demangling failed)
118: _PyEval_EvalCodeWithName (demangling failed)
119: _PyFunction_FastCallKeywords (demangling failed)
120: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
121: _PyEval_EvalFrameDefault (demangling failed)
122: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
123: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
124: _PyEval_EvalFrameDefault (demangling failed)
125: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x64432) [0x2ad587ab3432] (could not find stack frame symbol)
126: /lustre/scafellpike/local/HT04543/jxc06/jxw92-jxc06/miniconda3/envs/py37/lib/libpython3.7m.so.1.0(+0x6d936) [0x2ad587abc936] (could not find stack frame symbol)
127: _PyEval_EvalFrameDefault (demangling failed)
****************************************************************
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
FYI, I built LBANN with cmake (using openmpi version 3.1.6). I am also using python 3.7.
Any help to resolve this error would be greatly appreciated.
Metadata
Metadata
Assignees
Labels
No labels