Skip to content

Commit d0fbac3

Browse files
committed
Merge branch 'release-v0.100.0'
============================== Release Notes: v0.100 ============================== Support for new network structures: - 3D molecular generation models for Metal Organic Frameworks from the CoRE MOF Database. - 3D CosmoFlow Model - DenseNet - ATOM LSTM model - RAS state classifier - node2vec - Transformer and other attention-based models - ExaGAN (formerly CosmoGAN) - MaCC ICF surrogate model Applications: - Created a directory of example applications, deprecating the "model zoo" directory Support for new layers: - Embedding layer - Distributed embedding layer - Channel-wise scale/bias layer - Entry-wise scale/bias layer - Gated-Recurrent Units (GRU) - Entry-wise batchnorm - Argmax, Argmin, and one-hot layers - Layer norm - Deconvolution layer (transposed convolution) - Layers for channel-wise operations (channel-wise fully-connected, channel-wise softmax, channel-wise scale/bias, instance norm) - Matrix multiply layer Python front-end: - Can now configure contrib launcher with environment variables - Added NERSC compute center - Per-layer specification of compute device (CPU or GPU) - Option to write custom batch scripts with Python front-end Performance optimizations: - Parallelized Python data reader with "multiprocessing" module - Fuse batchnorm stats allreduces in FP/BP. - Tuned concatenate and slice layer - Dynamically allocate and free memory for layer error signals (halves LBANN's memory footprint) Model portability & usability: - Bamboo tests for individual layers Internal features: - Added support for DistConv features (distributed, generalized, parallel convolution) - Added support for NVSHMEM 1.0 API (used in distributed embedding layer and DistConv halo exchange) - Support for multiple data types per model (per-layer) - Support for per-layer mixed-precision weight training and inference, includes per-weight object and objective function mixed-precision. - Improved how and when the RNGs are initialized - Callback to dump images to TensorBoard - Callback to save model weights (useful to export to PyTorch) - Callback to save top K models (LTFB) - Improved run-to-run reproducibility by initializing weights in alphabetical order - Moved models from model_zoo directory to applications directory - Cleanup and refactoring of callbacks and layer instantiation - Grouped batchnorm statistics - Callback to print model description - Refactored trainer and training-state out of the model class - Support for transposing data in matrix multiply layers - Added DiHydrogen tensor and DistConv library - Added parallel strategy to layer class to support DistConv - LBANN inference mode supports loading models from multiple directories - Cleanup of checkpoint and restart logic I/O & data readers: - Added in-memory data store that caches samples in CPU memory. It can be loaded during the first epoch or preloaded - Added new "transform" data preprocessing ingestion pipeline - Added sample list format for specifying data sets - Introduced data coordinator that manages data readers and extracts them from the input layers - Data store is able to checkpoint / spill it's contents to local disk - Data reader for SMILE strings Build system: - Hydrogen 1.3.4 - Aluminum 0.3.3 - Improved documentation on read the docs (RTD) - Robust support for using Spack as a build system around CMake - Identified compute centers for specifying build and run dependencies - Added Catch2-based tests Bug fixes: - Fixed path resolution for dump weights, save model, and checkpoint callbacks - Added mutexes for preloading the data store - Fixed the LTFB exchange to include all ADAM optimizer state - Fixed the mapping of I/O RNGs to I/O processing threads to ensure consistent and correct multi-threaded performance Retired features: - moving MNIST data reader is replaced by python data reader - ASCII data reader is deprecated
2 parents 018018b + e13d34c commit d0fbac3

File tree

1,308 files changed

+106623
-78559
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,308 files changed

+106623
-78559
lines changed

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,8 @@ data.prototext*
1313
# Can also ignore all directories and files in a directory.
1414
# tmp/**/*
1515
build
16+
spack_environments/users/
17+
18+
19+
# we don't want to collect slurm output
20+
**/slurm-*.out

.gitmodules

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
[submodule "applications/graph/snap"]
2+
path = applications/graph/snap
3+
url = https://github.com/snap-stanford/snap
4+
ignore = dirty
5+
[submodule "applications/graph/largescale_node2vec"]
6+
path = applications/graph/largescale_node2vec
7+
url = https://lc.llnl.gov/bitbucket/scm/havoq/largescale_node2vec.git
8+
ignore = dirty
9+
[submodule "applications/ATOM/moses"]
10+
path = applications/ATOM/moses
11+
url = [email protected]:samadejacobs/moses.git

.readthedocs.yml

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,24 @@
11
# .readthedocs.yml
2+
# Config file for Read the Docs
3+
# https://docs.readthedocs.io/en/stable/config-file/v2.html
4+
5+
version: 2
6+
7+
sphinx:
8+
builder: html
9+
configuration: docs/conf.py
10+
11+
formats: []
212

313
build:
414
image: latest
515

616
python:
717
version: 3.7
18+
install:
19+
- requirements: docs/sphinx_requirements.txt
20+
21+
submodules:
22+
include: []
23+
24+

CMakeLists.txt

Lines changed: 149 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
cmake_minimum_required(VERSION 3.12)
1+
cmake_minimum_required(VERSION 3.13)
22

33
project(LBANN CXX)
44

@@ -48,7 +48,7 @@ endif ()
4848
#
4949

5050
set(LBANN_VERSION_MAJOR 0)
51-
set(LBANN_VERSION_MINOR 99)
51+
set(LBANN_VERSION_MINOR 100)
5252
set(LBANN_VERSION_PATCH 0)
5353

5454
set(LBANN_VERSION "${LBANN_VERSION_MAJOR}.${LBANN_VERSION_MINOR}.${LBANN_VERSION_PATCH}")
@@ -104,6 +104,20 @@ option(LBANN_WITH_CONDUIT "Enable Conduit library" ON)
104104

105105
option(LBANN_WITH_CUDNN "Include Nvidia cuDNN" ON)
106106

107+
option(LBANN_WITH_DIHYDROGEN "Build with DiHydrogen support" OFF)
108+
if (LBANN_WITH_DIHYDROGEN)
109+
message(WARNING "DiHydrogen support is currently expermimental. "
110+
"There is no stable interface. "
111+
"Use caution before using any features.")
112+
endif (LBANN_WITH_DIHYDROGEN)
113+
114+
option(LBANN_WITH_DISTCONV "Enable DiHydrogen's Distconv" OFF)
115+
if (LBANN_WITH_DISTCONV)
116+
message(WARNING "Distconv support is currently expermimental. "
117+
"There is no stable interface. "
118+
"Use caution before using any features.")
119+
endif (LBANN_WITH_DISTCONV)
120+
107121
option(LBANN_WITH_HWLOC
108122
"Enable topology-aware optimizations" ON)
109123

@@ -121,13 +135,10 @@ option(LBANN_WITH_VTUNE
121135
option(LBANN_WITH_UNIT_TESTING
122136
"Enable the unit testing framework (requires Catch2)" OFF)
123137

124-
# Enable parallel random matrix generation, if possible
138+
# Use deterministic GPU algorithms and layer operations
125139
option(LBANN_DETERMINISTIC
126140
"Use deterministic algorithms as much as possible." OFF)
127141

128-
option(LBANN_SEQUENTIAL_INITIALIZATION
129-
"Sequentially consistent initialization" OFF)
130-
131142
option(LBANN_DEBUG_PRINT_SUBTARGETS
132143
"Turn on debugging output of internal target properties." OFF)
133144
mark_as_advanced(LBANN_DEBUG_PRINT_SUBTARGETS)
@@ -161,6 +172,11 @@ include(SetupCXX)
161172
################################################################
162173

163174
# Required dependencies
175+
find_package(Threads REQUIRED)
176+
177+
# Argument parsing backend
178+
find_package(Clara REQUIRED)
179+
164180
find_package(CEREAL NO_MODULE
165181
HINTS ${CEREAL_DIR} $ENV{CEREAL_DIR}
166182
PATH_SUFFIXES share/cmake/cereal
@@ -172,16 +188,50 @@ set(LBANN_HAS_CEREAL ${CEREAL_FOUND})
172188
# The imported target is just called "cereal". Super.
173189

174190
# Setup the linear algebra library
175-
find_package(Hydrogen 1.2.0 NO_MODULE QUIET
191+
find_package(Hydrogen 1.3.3 NO_MODULE QUIET
176192
HINTS ${Hydrogen_DIR} ${HYDROGEN_DIR} $ENV{Hydrogen_DIR} $ENV{HYDROGEN_DIR}
177193
PATH_SUFFIXES lib/cmake/hydrogen
178194
NO_DEFAULT_PATH)
179195
if (NOT Hydrogen_FOUND)
180-
find_package(Hydrogen 1.2.0 NO_MODULE QUIET REQUIRED)
196+
find_package(Hydrogen 1.3.3 NO_MODULE QUIET REQUIRED)
181197
endif ()
182198
message(STATUS "Found Hydrogen: ${Hydrogen_DIR}")
183199
set(LBANN_HAS_HYDROGEN ${Hydrogen_FOUND})
184200

201+
# DiHydrogen and Distconv
202+
if (LBANN_WITH_DISTCONV AND NOT LBANN_WITH_DIHYDROGEN)
203+
message(FATAL_ERROR "Distconv requires DiHydrogen. Enable DiHydrogen to use Distconv.")
204+
endif ()
205+
206+
if (LBANN_WITH_DIHYDROGEN)
207+
if (LBANN_WITH_DISTCONV)
208+
find_package(DiHydrogen CONFIG COMPONENTS Meta Patterns DistConv
209+
HINTS ${DIHYDROGEN_DIR} $ENV{DIHYDROGEN_DIR}
210+
${H2_DIR} $ENV{H2_DIR}
211+
PATH_SUFFIXES install/lib64/cmake install/lib/cmake
212+
NO_DEFAULT_PATH)
213+
find_package(DiHydrogen CONFIG REQUIRED COMPONENTS Meta Patterns DistConv)
214+
set(LBANN_HAS_DISTCONV TRUE)
215+
else ()
216+
find_package(DiHydrogen CONFIG COMPONENTS Meta Patterns
217+
HINTS ${DIHYDROGEN_DIR} $ENV{DIHYDROGEN_DIR}
218+
${H2_DIR} $ENV{H2_DIR}
219+
PATH_SUFFIXES install/lib64/cmake install/lib/cmake
220+
NO_DEFAULT_PATH)
221+
find_package(DiHydrogen CONFIG REQUIRED COMPONENTS Meta Patterns)
222+
endif ()
223+
set(LBANN_HAS_DIHYDROGEN TRUE)
224+
endif ()
225+
226+
# Inherit half-precision stuff from Hydrogen
227+
set(LBANN_HAS_HALF ${HYDROGEN_HAVE_HALF}) # This is CPU-only
228+
229+
# Not the ideal fix, but should be fine for now.
230+
if (Aluminum_FOUND)
231+
message(STATUS "Aluminum found in Hydrogen. Using Aluminum.")
232+
set(LBANN_WITH_ALUMINUM ON CACHE BOOL "Use aluminum." FORCE)
233+
endif ()
234+
185235
include(SetupOpenMP)
186236
include(SetupMPI)
187237
include(SetupProtobuf)
@@ -201,6 +251,11 @@ set(LBANN_HAS_OPENCV ${OpenCV_FOUND})
201251
set(LBANN_HAS_CUDA ${_HYDROGEN_HAVE_CUDA})
202252
set(LBANN_WITH_CUDA ${LBANN_HAS_CUDA})
203253

254+
# Only used if have GPU and have CPU half.
255+
if (LBANN_HAS_CUDA AND LBANN_HAS_HALF)
256+
set(LBANN_HAS_GPU_FP16 ${HYDROGEN_GPU_USE_FP16})
257+
endif ()
258+
204259
if (LBANN_HAS_CUDA)
205260
enable_language(CUDA)
206261

@@ -214,13 +269,15 @@ endif ()
214269
if (LBANN_WITH_ALUMINUM)
215270
# Aluminum may have already been found by Hydrogen
216271
if (NOT Aluminum_FOUND)
217-
find_package(Aluminum 0.2.0 NO_MODULE QUIET
272+
message(WARNING
273+
"Using Aluminum without Hydrogen support may not be well-supported.")
274+
find_package(Aluminum 0.3.0 NO_MODULE QUIET
218275
HINTS ${Aluminum_DIR} ${ALUMINUM_DIR} ${AL_DIR}
219276
$ENV{Aluminum_DIR} $ENV{ALUMINUM_DIR} $ENV{AL_DIR}
220277
PATH_SUFFIXES lib64/cmake/aluminum lib/cmake/aluminum
221278
NO_DEFAULT_PATH)
222279
if (NOT Aluminum_FOUND)
223-
find_package(Aluminum 0.2.0 NO_MODULE QUIET)
280+
find_package(Aluminum 0.3.0 NO_MODULE QUIET)
224281
endif ()
225282
endif ()
226283
set(LBANN_HAS_ALUMINUM ${Aluminum_FOUND})
@@ -264,13 +321,28 @@ if (LBANN_HAS_CUDA)
264321

265322
include(SetupCUDAToolkit)
266323

324+
if (LBANN_HAS_GPU_FP16)
325+
set_property(TARGET cuda::toolkit PROPERTY
326+
INTERFACE_COMPILE_OPTIONS $<$<COMPILE_LANGUAGE:CUDA>:-arch=sm_60>)
327+
endif (LBANN_HAS_GPU_FP16)
328+
267329
set(LBANN_HAS_CUDNN ${CUDNN_FOUND})
268330

269331
if (LBANN_HAS_ALUMINUM AND AL_HAS_NCCL)
270332
set(LBANN_HAS_NCCL2 TRUE)
271333
else ()
272334
set(LBANN_HAS_NCCL2 FALSE)
273335
endif ()
336+
337+
if (LBANN_WITH_NVSHMEM)
338+
find_package(NVSHMEM REQUIRED)
339+
set_property(TARGET cuda::toolkit PROPERTY
340+
INTERFACE_COMPILE_OPTIONS $<$<COMPILE_LANGUAGE:CUDA>:-arch=sm_70>)
341+
# Build LBANN as a static library to get around a bug in NVSHMEM
342+
set(BUILD_SHARED_LIBS OFF)
343+
endif ()
344+
set(LBANN_HAS_NVSHMEM "${NVSHMEM_FOUND}")
345+
274346
endif (LBANN_HAS_CUDA)
275347

276348
# This shouldn't be here, but is ok for now. This will occasionally be
@@ -415,22 +487,28 @@ if (LBANN_WITH_CONDUIT)
415487
endif ()
416488
endforeach ()
417489

490+
get_filename_component(_conduit_include_dirs
491+
"${CONDUIT_INCLUDE_DIRS}" DIRECTORY)
492+
418493
if (HDF5_FOUND_WITH_MODULE)
419494
list(APPEND _conduit_interface_link_libs
420495
${HDF5_LIBRARIES})
421496

422-
set_target_properties(conduit::conduit
423-
PROPERTIES
424-
INTERFACE_INCLUDE_DIRECTORIES "${HDF5_INCLUDE_DIRS}")
497+
list(APPEND _conduit_include_dirs
498+
"${HDF5_INCLUDE_DIRS}")
425499
endif ()
426500

501+
set_property(TARGET conduit::conduit
502+
PROPERTY
503+
INTERFACE_INCLUDE_DIRECTORIES
504+
"${_conduit_include_dirs}")
505+
427506
set_target_properties(conduit::conduit
428507
PROPERTIES
429508
INTERFACE_LINK_LIBRARIES
430509
"${_conduit_interface_link_libs}")
431510

432511
set(CONDUIT_LIBRARIES conduit::conduit)
433-
set(LBANN_HAS_CONDUIT ${Conduit_FOUND})
434512
endif (LBANN_WITH_CONDUIT)
435513

436514
if (LBANN_WITH_UNIT_TESTING)
@@ -446,7 +524,11 @@ if (LBANN_WITH_UNIT_TESTING)
446524
# Now that Catch2 has been found, start adding the unit tests
447525
include(CTest)
448526
include(Catch)
527+
add_subdirectory(src/proto/unit_test)
449528
add_subdirectory(src/utils/unit_test)
529+
add_subdirectory(src/weights/unit_test)
530+
add_subdirectory(src/transforms/unit_test)
531+
add_subdirectory(src/transforms/vision/unit_test)
450532

451533
# Add this one last
452534
add_subdirectory(unit_test)
@@ -459,16 +541,16 @@ add_subdirectory(docs)
459541
# Build LBANN
460542
################################################################
461543

544+
# Add LBANN source files
545+
add_subdirectory(include)
546+
add_subdirectory(src)
547+
462548
# Write the configure file
463549
configure_file(
464550
"${CMAKE_SOURCE_DIR}/cmake/configure_files/lbann_config.hpp.in"
465551
"${CMAKE_BINARY_DIR}/lbann_config.hpp"
466552
@ONLY)
467553

468-
# Add LBANN source files
469-
add_subdirectory(include)
470-
add_subdirectory(src)
471-
472554
# Create the LBANN library
473555
add_library(lbann ${LBANN_SOURCES} ${LBANN_HEADERS} ${LBANN_CUDA_SOURCES})
474556

@@ -477,12 +559,10 @@ target_include_directories(lbann PUBLIC
477559
$<BUILD_INTERFACE:${CMAKE_SOURCE_DIR}/include>
478560
$<INSTALL_INTERFACE:${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_INCLUDEDIR}>)
479561

480-
if (LBANN_HAS_PYTHON)
481-
target_include_directories(lbann PUBLIC ${Python_INCLUDE_DIRS})
482-
endif ()
483-
484562
# Use the IMPORTED targets when possible.
485563
target_link_libraries(lbann PUBLIC LbannProto)
564+
target_link_libraries(lbann PUBLIC Threads::Threads)
565+
target_link_libraries(lbann PUBLIC clara::clara)
486566
target_link_libraries(lbann PUBLIC cereal)
487567
target_link_libraries(lbann PUBLIC OpenMP::OpenMP_CXX)
488568
target_link_libraries(lbann PUBLIC MPI::MPI_CXX)
@@ -491,6 +571,15 @@ target_link_libraries(lbann PUBLIC ${HYDROGEN_LIBRARIES})
491571
target_link_libraries(lbann PUBLIC ${OpenCV_LIBRARIES})
492572
target_link_libraries(lbann PUBLIC ${CONDUIT_LIBRARIES})
493573

574+
target_link_libraries(lbann PUBLIC
575+
$<TARGET_NAME_IF_EXISTS:H2::H2Meta>
576+
$<TARGET_NAME_IF_EXISTS:H2::H2Patterns>
577+
)
578+
579+
if (LBANN_WITH_DISTCONV)
580+
target_link_libraries(lbann PUBLIC H2::H2DistConv)
581+
endif ()
582+
494583
if (LBANN_HAS_TBINF)
495584
target_link_libraries(lbann PUBLIC TBinf)
496585
endif ()
@@ -512,7 +601,12 @@ if (LBANN_HAS_VTUNE)
512601
endif ()
513602

514603
if (LBANN_HAS_PYTHON)
515-
target_link_libraries(lbann PUBLIC ${Python_LIBRARIES})
604+
target_link_libraries(lbann PUBLIC Python::Python)
605+
endif ()
606+
607+
if (LBANN_HAS_NVSHMEM)
608+
set_property(TARGET lbann PROPERTY CUDA_SEPARABLE_COMPILATION ON)
609+
target_link_libraries(lbann PUBLIC NVSHMEM::NVSHMEM)
516610
endif ()
517611

518612
if (TARGET LBANN_CXX_FLAGS_werror)
@@ -521,6 +615,27 @@ endif ()
521615

522616
target_link_libraries(lbann PUBLIC ${DL_LIBRARY})
523617

618+
# Fix the -g issue with Clang on OSX
619+
if (APPLE)
620+
# Remove -g from the options
621+
string(REPLACE "-g" "" CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}")
622+
string(REPLACE "-g" "" CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG}")
623+
624+
# Get all the sources and add "-g" to all of them.
625+
get_target_property(_LBANN_SRCS lbann SOURCES)
626+
set_source_files_properties(${_LBANN_SRCS}
627+
PROPERTIES COMPILE_OPTIONS "-g")
628+
629+
# Cleanup source files
630+
foreach (bad_file IN LISTS _LBANN_SRCS)
631+
get_source_file_property(
632+
_SRC_COMPILE_OPTS "${bad_file}" COMPILE_OPTIONS)
633+
string(REPLACE "-g" "" _SRC_COMPILE_OPTS "${COMPILE_OPTIONS}")
634+
set_source_files_properties(
635+
"${bad_file}" PROPERTIES COMPILE_OPTIONS "${_SRC_COMPILE_OPTS}")
636+
endforeach ()
637+
endif ()
638+
524639
# Clean things up
525640
include(LBANNDebugUtilities)
526641
lbann_remove_default_include_paths_from_all_subtargets(lbann)
@@ -539,6 +654,8 @@ endif ()
539654
add_subdirectory(model_zoo)
540655
add_subdirectory(model_zoo/tests)
541656
add_subdirectory(model_zoo/jag_utils)
657+
add_subdirectory(applications/CANDLE/pilot2/tools)
658+
add_subdirectory(applications/ATOM/utils)
542659
add_subdirectory(tests)
543660
add_subdirectory(scripts)
544661

@@ -733,6 +850,8 @@ string(APPEND _str "\n")
733850
#Print the true/false guys
734851
append_str_tf(_str
735852
LBANN_GNU_LINUX
853+
LBANN_HAS_DIHYDROGEN
854+
LBANN_HAS_DISTCONV
736855
LBANN_HAS_HYDROGEN
737856
LBANN_HAS_OPENCV
738857
LBANN_HAS_CEREAL
@@ -747,7 +866,6 @@ append_str_tf(_str
747866
LBANN_HAS_DOXYGEN
748867
LBANN_HAS_LBANN_PROTO
749868
LBANN_HAS_ALUMINUM
750-
LBANN_HAS_CONDUIT
751869
LBANN_HAS_PYTHON)
752870
string(APPEND _str
753871
"\n== End LBANN Configuration Summary ==\n")
@@ -774,6 +892,13 @@ configure_file(
774892
"${CMAKE_SOURCE_DIR}/cmake/configure_files/lbann_module.lua.in"
775893
"${CMAKE_BINARY_DIR}/lbann_module.lua.install"
776894
@ONLY)
895+
configure_file(
896+
"${CMAKE_SOURCE_DIR}/cmake/configure_files/lbann_module.tcl.in"
897+
"${CMAKE_BINARY_DIR}/lbann_module.tcl.install")
898+
777899
install(FILES "${CMAKE_BINARY_DIR}/lbann_module.lua.install"
778900
RENAME "${LBANN_MODULEFILE_NAME}"
779901
DESTINATION "${CMAKE_INSTALL_SYSCONFDIR}/modulefiles")
902+
install(FILES "${CMAKE_BINARY_DIR}/lbann_module.tcl.install"
903+
RENAME "${LBANN_VERSION}"
904+
DESTINATION "${CMAKE_INSTALL_SYSCONFDIR}/modulefiles/lbann")

0 commit comments

Comments
 (0)