Shared memory error

Hello, I am trying to run your code, but I am getting errors about share memory which don't make sense. 

Output below along with some info about the machine (~50 GB RAM + V100 with an 8 core Xeon )

 python3 train.py --train_path omniglot/python/images_background \
>                  --test_path  omniglot/python/images_evaluation \
>                  --gpu_ids 0 \
>                  --model_path models
use gpu: 0 to train.
begin loading training dataset to memory
finish loading training dataset to memory
begin loading test dataset to memory
finish loading test dataset to memory
/usr/local/lib/python3.7/dist-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='mean' instead.
  warnings.warn(warning.format(ret))
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/usr/lib/python3.7/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/reductions.py", line 289, in rebuild_storage_fd
    fd = df.detach()
  File "/usr/lib/python3.7/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/usr/lib/python3.7/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 498, in Client
    answer_challenge(c, authkey)
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 741, in answer_challenge
    message = connection.recv_bytes(256)         # reject large message
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 79, in <module>
    for batch_id, (img1, img2, label) in enumerate(trainLoader, 1):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1186, in _next_data
    idx, data = self._get_data()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1152, in _get_data
    success, data = self._try_get_data()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1014, in _try_get_data
    fs = [tempfile.NamedTemporaryFile() for i in range(fds_limit_margin)]
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1014, in <listcomp>
    fs = [tempfile.NamedTemporaryFile() for i in range(fds_limit_margin)]
  File "/usr/lib/python3.7/tempfile.py", line 677, in NamedTemporaryFile
    prefix, suffix, dir, output_type = _sanitize_params(prefix, suffix, dir)
  File "/usr/lib/python3.7/tempfile.py", line 265, in _sanitize_params
    dir = gettempdir()
  File "/usr/lib/python3.7/tempfile.py", line 433, in gettempdir
    tempdir = _get_default_tempdir()
  File "/usr/lib/python3.7/tempfile.py", line 349, in _get_default_tempdir
    fp.write(b'blat')
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 488) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

itamblyn@host free -m
              total        used        free      shared  buff/cache   available
Mem:          52329         716       41895           8        9717       51060
Swap:             0           0           0


itamblyn@host ipcs -lm

------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 18014398509465599
max total shared memory (kbytes) = 18014398509481980
min seg size (bytes) = 1

 nvidia-smi 
Fri Aug 20 01:01:45 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P0    38W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |







Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Shared memory error #16

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Shared memory error #16

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions