Skip to content

TF2 Albert pretraining crashes intermittently #41

Open
@rondogency

Description

@rondogency

We have seen TF2 Albert pretraining crashes intermittently every 1 out of ~3 runs using latest Horovod training on 8 nodes; the crash happens around 3000 steps

Error message:

Loss: 6.436, MLM: 6.029, SOP: 0.407, MLM_acc: 0.173, SOP_acc: 0.825:   3%|▎         | 3420/125000 [36:14<20:52:24,  1.62it/s][1,2
6]<stderr>:2020-09-16 08:18:47.774600: F tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:419] ptxas returned an error durin
g compilation of ptx to sass: 'Internal: ptxas exited with non-zero error code 256, output: '  If the error message indicates tha
t a file could not be written, please verify that sufficient filesystem space is provided.
[1,26]<stderr>:[ip-192-168-73-239:93007] *** Process received signal ***
[1,26]<stderr>:[ip-192-168-73-239:93007] Signal: Aborted (6)
[1,26]<stderr>:[ip-192-168-73-239:93007] Signal code:  (-6)
[1,26]<stderr>:[ip-192-168-73-239:93007] [ 0] [1,26]<stderr>:/lib/x86_64-linux-gnu/libpthread.so.0(+0x128a0)[0x7f2733f788a0]
[1,26]<stderr>:[ip-192-168-73-239:93007] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f2733bb3f47]
[1,26]<stderr>:[ip-192-168-73-239:93007] [ 2] [1,26]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f2733bb58b1]
[1,26]<stderr>:[ip-192-168-73-239:93007] [ 3] [1,26]<stderr>:/shared/tensorflow2_env/lib/python3.8/site-packages/tensorflow/pytho
n/_pywrap_tensorflow_internal.so(+0xdb80954)[0x7f26a7a3b954]
[1,26]<stderr>:[ip-192-168-73-239:93007] [ 4] [1,26]<stderr>:/shared/tensorflow2_env/lib/python3.8/site-packages/tensorflow/pytho
n/_pywrap_tensorflow_internal.so(_ZN3xla3gpu13NVPTXCompiler30CompileGpuAsmOrGetCachedResultEPN15stream_executor14StreamExecutorER
KSsiiRKNS_15HloModuleConfigE+0xd28)[0x7f269f313cd8]
[1,26]<stderr>:[ip-192-168-73-239:93007] [ 5] [1,26]<stderr>:/shared/tensorflow2_env/lib/python3.8/site-packages/tensorflow/pytho
n/_pywrap_tensorflow_internal.so(_ZN3xla3gpu13NVPTXCompiler19CompileTargetBinaryEPKNS_9HloModuleEPN4llvm6ModuleEN4absl14lts_2020_
02_257variantIJSt4pairIiiEiEEEPN15stream_executor14StreamExecutorE+0x562)[0x7f269f314352]
[1,26]<stderr>:[ip-192-168-73-239:93007] [ 6] [1,26]<stderr>:/shared/tensorflow2_env/lib/python3.8/site-packages/tensorflow/pytho
n/_pywrap_tensorflow_internal.so(_ZN3xla3gpu11GpuCompiler10RunBackendESt10unique_ptrINS_9HloModuleESt14default_deleteIS3_EEPN15st
ream_executor14StreamExecutorEPNS7_21DeviceMemoryAllocatorE+0xad8)[0x7f269f33c668]
[1,26]<stderr>:[ip-192-168-73-239:93007] [ 7] [1,26]<stderr>:/shared/tensorflow2_env/lib/python3.8/site-packages/tensorflow/pytho
n/_pywrap_tensorflow_internal.so(_ZN3xla7Service15BuildExecutableERKNS_14HloModuleProtoESt10unique_ptrINS_15HloModuleConfigESt14d
efault_deleteIS5_EEPNS_7BackendEPN15stream_executor14StreamExecutorEPNSB_21DeviceMemoryAllocatorE+0x165)[0x7f269f2edfa5]
[1,26]<stderr>:[ip-192-168-73-239:93007] [ 8] [1,26]<stderr>:/shared/tensorflow2_env/lib/python3.8/site-packages/tensorflow/pytho
n/_pywrap_tensorflow_internal.so(_ZN3xla12LocalService18CompileExecutablesERKNS_14XlaComputationEN4absl14lts_2020_02_254SpanIKPKN
S_5ShapeEEERKNS_22ExecutableBuildOptionsE+0x229d)[0x7f269f2e136d]
[1,26]<stderr>:[ip-192-168-73-239:93007] [ 9] [1,26]<stderr>:/shared/tensorflow2_env/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN3xla11LocalClient7CompileERKNS_14XlaComputationEN4absl14lts_2020_02_254SpanIKPKNS_5ShapeEEERKNS_22ExecutableBuildOptionsE+0x24e)[0x7f269f2c939e]
[1,26]<stderr>:[ip-192-168-73-239:93007] [10] [1,26]<stderr>:/shared/tensorflow2_env/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow19XlaCompilationCache15BuildExecutableERKNS_11XlaCompiler7OptionsERKNS1_17CompilationResultEPSt10unique_ptrIN3xla15LocalExecutableESt14default_deleteISA_EE+0x234)[0x7f269e4d5094]
[1,26]<stderr>:[ip-192-168-73-239:93007] [11] [1,26]<stderr>:/shared/tensorflow2_env/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow19XlaCompilationCache11CompileImplERKNS_11XlaCompiler7OptionsERKNS_12NameAttrListEN4absl14lts_2020_02_254SpanIKNS1_8ArgumentEEERKSt8functionIFNS_6StatusEPS1_PNS1_17CompilationResultEEENS9_8optionalIxEEPPKSH_PPN3xla15LocalExecutableE+0xa25)[0x7f269e4d8bb5]
[1,26]<stderr>:[ip-192-168-73-239:93007] [12] [1,26]<stderr>:/shared/tensorflow2_env/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow19XlaCompilationCache7CompileERKNS_11XlaCompiler7OptionsERKNS_12NameAttrListEN4absl14lts_2020_02_254SpanIKNS1_8ArgumentEEERKNS1_14CompileOptionsENS0_11CompileModeEPPKNS1_17CompilationResultEPPN3xla15LocalExecutableE+0xd6)[0x7f269e4daad6]
[1,26]<stderr>:[ip-192-168-73-239:93007] [13] [1,26]<stderr>:/shared/tensorflow2_env/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x45f77ad)[0x7f269e4b27ad]
[1,26]<stderr>:[ip-192-168-73-239:93007] [14] [1,26]<stderr>:/shared/tensorflow2_env/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow12XlaCompileOp7ComputeEPNS_15OpKernelContextE+0xc6c)[0x7f269e4b3a2c]
[1,26]<stderr>:[ip-192-168-73-239:93007] [15] [1,26]<stderr>:/shared/tensorflow2_env/lib/python3.8/site-packages/tensorflow/pytho[1,26]<stderr>:[ip-192-168-73-239:93007] [ 7] [1,26]<stderr>:/shared/tensorflow2_env/lib/python3.8/site-packages/tensorflow/pytho
n/_pywrap_tensorflow_internal.so(_ZN3xla7Service15BuildExecutableERKNS_14HloModuleProtoESt10unique_ptrINS_15HloModuleConfigESt14d
efault_deleteIS5_EEPNS_7BackendEPN15stream_executor14StreamExecutorEPNSB_21DeviceMemoryAllocatorE+0x165)[0x7f269f2edfa5]
[1,26]<stderr>:[ip-192-168-73-239:93007] [ 8] [1,26]<stderr>:/shared/tensorflow2_env/lib/python3.8/site-packages/tensorflow/pytho
n/_pywrap_tensorflow_internal.so(_ZN3xla12LocalService18CompileExecutablesERKNS_14XlaComputationEN4absl14lts_2020_02_254SpanIKPKN
S_5ShapeEEERKNS_22ExecutableBuildOptionsE+0x229d)[0x7f269f2e136d]
[1,26]<stderr>:[ip-192-168-73-239:93007] [ 9] [1,26]<stderr>:/shared/tensorflow2_env/lib/python3.8/site-packages/tensorflow/pytho
n/_pywrap_tensorflow_internal.so(_ZN3xla11LocalClient7CompileERKNS_14XlaComputationEN4absl14lts_2020_02_254SpanIKPKNS_5ShapeEEERK
NS_22ExecutableBuildOptionsE+0x24e)[0x7f269f2c939e]
[1,26]<stderr>:[ip-192-168-73-239:93007] [10] [1,26]<stderr>:/shared/tensorflow2_env/lib/python3.8/site-packages/tensorflow/pytho
n/_pywrap_tensorflow_internal.so(_ZN10tensorflow19XlaCompilationCache15BuildExecutableERKNS_11XlaCompiler7OptionsERKNS1_17Compila
tionResultEPSt10unique_ptrIN3xla15LocalExecutableESt14default_deleteISA_EE+0x234)[0x7f269e4d5094]
[1,26]<stderr>:[ip-192-168-73-239:93007] [11] [1,26]<stderr>:/shared/tensorflow2_env/lib/python3.8/site-packages/tensorflow/pytho
n/_pywrap_tensorflow_internal.so(_ZN10tensorflow19XlaCompilationCache11CompileImplERKNS_11XlaCompiler7OptionsERKNS_12NameAttrList
EN4absl14lts_2020_02_254SpanIKNS1_8ArgumentEEERKSt8functionIFNS_6StatusEPS1_PNS1_17CompilationResultEEENS9_8optionalIxEEPPKSH_PPN
3xla15LocalExecutableE+0xa25)[0x7f269e4d8bb5]
[1,26]<stderr>:[ip-192-168-73-239:93007] [12] [1,26]<stderr>:/shared/tensorflow2_env/lib/python3.8/site-packages/tensorflow/pytho
n/_pywrap_tensorflow_internal.so(_ZN10tensorflow19XlaCompilationCache7CompileERKNS_11XlaCompiler7OptionsERKNS_12NameAttrListEN4ab
sl14lts_2020_02_254SpanIKNS1_8ArgumentEEERKNS1_14CompileOptionsENS0_11CompileModeEPPKNS1_17CompilationResultEPPN3xla15LocalExecut
ableE+0xd6)[0x7f269e4daad6]
[1,26]<stderr>:[ip-192-168-73-239:93007] [13] [1,26]<stderr>:/shared/tensorflow2_env/lib/python3.8/site-packages/tensorflow/pytho
n/_pywrap_tensorflow_internal.so(+0x45f77ad)[0x7f269e4b27ad]
[1,26]<stderr>:[ip-192-168-73-239:93007] [14] [1,26]<stderr>:/shared/tensorflow2_env/lib/python3.8/site-packages/tensorflow/pytho
n/_pywrap_tensorflow_internal.so(_ZN10tensorflow12XlaCompileOp7ComputeEPNS_15OpKernelContextE+0xc6c)[0x7f269e4b3a2c]
[1,26]<stderr>:[ip-192-168-73-239:93007] [15] [1,26]<stderr>:/shared/tensorflow2_env/lib/python3.8/site-packages/tensorflow/pytho
n/../libtensorflow_framework.so.2(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x1c3)[0x7f26990059

Version Used:
Horovod: 0.20.0
TF: 2.3

Launch command used: (adapted from the config in Albert README)

/opt/amazon/openmpi/bin/mpirun --hostfile /shared/hosts -N 8 --allow-run-as-root --mca plm_rsh_no_tree_spawn 1 --mca btl_tcp_if_include ens5 
--tag-output --oversubscribe -x RDMAV_FORK_SAFE=1 -x LD_LIBRARY_PATH=/opt/amazon/openmpi/lib:$LD_LIBRARY_PATH -x PATH=/opt/amazon/openmpi/bin:$PATH 
-x PYTHONPATH=$PYTHONPATH:/shared/deep-learning-models/models/nlp -x NCCL_SOCKET_IFNAME=ens5 -x NCCL_DEBUG=INFO 
/shared/tensorflow2_env/bin/python /shared/deep-learning-models/models/nlp/albert/run_pretraining.py 
--train_dir=/scratch/data/albert/train --val_dir=/scratch/data/albert/validation --log_dir=/shared --checkpoint_dir=/shared/checkpoints 
--load_from=scratch --model_type=albert --model_size=base --per_gpu_batch_size=32 --gradient_accumulation_steps=2 --warmup_steps=3125 --total_steps=125000 --learning_rate=0.00176 --optimizer=lamb

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions