-
Couldn't load subscription status.
- Fork 2
Open
Description
Hi,
I am testing the software but stucked at the training phase, the machine totally hangs after allocating all the memory. (128G of RAM, 32 cores and 1 M60)
After restricting the memory in the docker-compose to 90G it seems to work, but for instance when crashing it throws a warning like:
Warning: unable to close filehandle properly: Cannot allocate memory during global destruction.
And after a while this:
cdeep3m_1 | ERROR: caffe had a non zero exit code: 134
cdeep3m_1 | /home/cdeep3m/caffetrain.sh: line 166: 100 Aborted (core dumped) GLOG_log_dir=$log_dir caffe.bin train --solver=$model_dir/solver.prototxt --gpu $gpu $snapshot_opts > "${model_dir}/log/out.log" 2>&1
cdeep3m_1 | ERROR: caffe had a non zero exit code: 137
cdeep3m_1 | /home/cdeep3m/caffetrain.sh: line 166: 127 Killed GLOG_log_dir=$log_dir caffe.bin train --solver=$model_dir/solver.prototxt --gpu $gpu $snapshot_opts > "${model_dir}/log/out.log" 2>&1
GPU looks like:
nvidia-smi
Mon Apr 8 14:01:47 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.107 Driver Version: 410.107 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 On | 00000000:06:00.0 Off | Off |
| 32% 36C P0 36W / 120W | 262MiB / 8129MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M60 On | 00000000:07:00.0 Off | Off |
| 32% 27C P8 14W / 120W | 11MiB / 8129MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 30767 C caffe.bin 109MiB |
| 0 30793 C caffe.bin 109MiB |
+----------------------------------------------------------------------------
Apr 8 14:01:06 opskvm01 kernel: Memory cgroup stats for /docker/6e765d2d36b931a1188c2c1f93552068f2d68d46e0060e11986265dd5fa83e0d: cache:93406836KB rss:1472KB rss_huge:0KB mapped_file:88703160KB swap:393296KB inactive_anon:4703640KB active_anon:88704632KB inactive_file:0KB active_file:0KB unevictable:0KB
Apr 8 14:01:06 opskvm01 kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Apr 8 14:01:06 opskvm01 kernel: [29973] 23446 29973 4545 322 14 99 0 runtraining.sh
Apr 8 14:01:06 opskvm01 kernel: [30186] 23446 30186 4516 324 14 71 0 trainworker.sh
Apr 8 14:01:06 opskvm01 kernel: [30203] 23446 30203 11475 703 27 3271 0 perl
Apr 8 14:01:06 opskvm01 kernel: [30256] 23446 30256 4546 325 14 101 0 caffetrain.sh
Apr 8 14:01:06 opskvm01 kernel: [30281] 23446 30281 40152923 10540653 23224 47801 0 caffe.bin
Apr 8 14:01:06 opskvm01 kernel: [30292] 23446 30292 4546 325 14 101 0 caffetrain.sh
Apr 8 14:01:06 opskvm01 kernel: [30314] 23446 30314 40153011 11681879 23151 46889 0 caffe.bin
Apr 8 14:01:06 opskvm01 kernel: [30697] 23446 30697 4570 498 14 0 0 bash
Apr 8 14:01:06 opskvm01 kernel: Memory cgroup out of memory: Kill process 30319 (caffe.bin) score 478 or sacrifice child
Apr 8 14:01:06 opskvm01 kernel: Killed process 30314 (caffe.bin) total-vm:160612044kB, anon-rss:0kB, file-rss:93580kB, shmem-rss:46633936kB
Apr 8 14:01:16 opskvm01 kernel: ___slab_alloc: 42 callbacks suppressed
Apr 8 14:01:16 opskvm01 kernel: SLUB: Unable to allocate memory on node -1 (gfp=0x80d0)
Apr 8 14:01:16 opskvm01 kernel: cache: taskstats(4:6e765d2d36b931a1188c2c1f93552068f2d68d46e0060e11986265dd5fa83e0d), object size: 328, buffer size: 328, default order: 2, min order: 0
Is anyone else having issues similar to this?
Cheers.
Metadata
Metadata
Assignees
Labels
No labels