Skip to content

Machine freezes because of running out of memory. #8

@titansmc

Description

@titansmc

Hi,
I am testing the software but stucked at the training phase, the machine totally hangs after allocating all the memory. (128G of RAM, 32 cores and 1 M60)
After restricting the memory in the docker-compose to 90G it seems to work, but for instance when crashing it throws a warning like:

Warning: unable to close filehandle properly: Cannot allocate memory during global destruction.

And after a while this:

cdeep3m_1  | ERROR: caffe had a non zero exit code: 134
cdeep3m_1  | /home/cdeep3m/caffetrain.sh: line 166:   100 Aborted                 (core dumped) GLOG_log_dir=$log_dir caffe.bin train --solver=$model_dir/solver.prototxt --gpu $gpu $snapshot_opts > "${model_dir}/log/out.log" 2>&1
cdeep3m_1  | ERROR: caffe had a non zero exit code: 137
cdeep3m_1  | /home/cdeep3m/caffetrain.sh: line 166:   127 Killed                  GLOG_log_dir=$log_dir caffe.bin train --solver=$model_dir/solver.prototxt --gpu $gpu $snapshot_opts > "${model_dir}/log/out.log" 2>&1

GPU looks like:

nvidia-smi
Mon Apr  8 14:01:47 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.107      Driver Version: 410.107      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M60           On   | 00000000:06:00.0 Off |                  Off |
| 32%   36C    P0    36W / 120W |    262MiB /  8129MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M60           On   | 00000000:07:00.0 Off |                  Off |
| 32%   27C    P8    14W / 120W |     11MiB /  8129MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     30767      C   caffe.bin                                    109MiB |
|    0     30793      C   caffe.bin                                    109MiB |
+----------------------------------------------------------------------------
Apr  8 14:01:06 opskvm01 kernel: Memory cgroup stats for /docker/6e765d2d36b931a1188c2c1f93552068f2d68d46e0060e11986265dd5fa83e0d: cache:93406836KB rss:1472KB rss_huge:0KB mapped_file:88703160KB swap:393296KB inactive_anon:4703640KB active_anon:88704632KB inactive_file:0KB active_file:0KB unevictable:0KB
Apr  8 14:01:06 opskvm01 kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Apr  8 14:01:06 opskvm01 kernel: [29973] 23446 29973     4545      322      14       99             0 runtraining.sh
Apr  8 14:01:06 opskvm01 kernel: [30186] 23446 30186     4516      324      14       71             0 trainworker.sh
Apr  8 14:01:06 opskvm01 kernel: [30203] 23446 30203    11475      703      27     3271             0 perl
Apr  8 14:01:06 opskvm01 kernel: [30256] 23446 30256     4546      325      14      101             0 caffetrain.sh
Apr  8 14:01:06 opskvm01 kernel: [30281] 23446 30281 40152923 10540653   23224    47801             0 caffe.bin
Apr  8 14:01:06 opskvm01 kernel: [30292] 23446 30292     4546      325      14      101             0 caffetrain.sh
Apr  8 14:01:06 opskvm01 kernel: [30314] 23446 30314 40153011 11681879   23151    46889             0 caffe.bin
Apr  8 14:01:06 opskvm01 kernel: [30697] 23446 30697     4570      498      14        0             0 bash
Apr  8 14:01:06 opskvm01 kernel: Memory cgroup out of memory: Kill process 30319 (caffe.bin) score 478 or sacrifice child
Apr  8 14:01:06 opskvm01 kernel: Killed process 30314 (caffe.bin) total-vm:160612044kB, anon-rss:0kB, file-rss:93580kB, shmem-rss:46633936kB
Apr  8 14:01:16 opskvm01 kernel: ___slab_alloc: 42 callbacks suppressed
Apr  8 14:01:16 opskvm01 kernel: SLUB: Unable to allocate memory on node -1 (gfp=0x80d0)
Apr  8 14:01:16 opskvm01 kernel:  cache: taskstats(4:6e765d2d36b931a1188c2c1f93552068f2d68d46e0060e11986265dd5fa83e0d), object size: 328, buffer size: 328, default order: 2, min order: 0

Is anyone else having issues similar to this?

Cheers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions