Skip to content

Commit 13b5167

Browse files
committed
Merge branch 'release-v0.101'
============================== Release Notes: v0.101 ============================== Support for new training algorithms: Support for new network structures: - ATOM VAE model - Graph neural networks - Graph Convolutional Networks (GCN) - 3D U-Net Model Support for new layers: - Implemented optimized GRU layer using cuDNN kernel - Graph Layers: GCN, GIN, Graph, GatedGraph Python front-end: - Support for Graph and Graph Convolutional Networks - Added support for OCLF data center (Summit) Performance optimizations: - Optimize CUDA kernel for tensor reordering in GRU layer - Enabled TensorCore optimization for GRU layer - GCN and Graph layers also have a faster Dense variant which only utilizes Matrix Multiplication Model portability & usability: - Added Users Quickstart section to documentation including PyTorch to LBANN mini-tutorial - Added section on callbacks with detailed instructions on summarize images callback Internal features: - Support for double data type in distributed embedding layer - Support for large number of channels in GPU batchnorm layer - Modified LTFB so that NaNs lose tournaments - Improved numerical stability of reconstruction loss in ATOM VAE model - Skip bad gradients in Adam I/O & data readers: - Added support for ImageNet data reader to use sample lists - Refactored sample list code to be more flexible and generalize beyond JAG data reader - Added support for slab-based I/O in HDF5 data reader required by DistConv implementations of CosmoFlow 3D volumes - Extended slab-based HDF5 data reader to support labels and reconstruction modes for use with U-Net architecture Datasets: - Added two graph datasets (MNIST, and PROTEINS) Build system and Dependent Libraries: - Hydrogen 1.4.0 - Aluminum 0.4.0 - Spack v0.15.4+ (Requires new format for environments) - cuDNN 8.0.2 - Require C++14 - Added Spack build support for OCLF data center (Summit) Bug fixes: - Properly reset data coordinator after each LTFB round - Fixed bug in weights proxy when weights buffer is reallocated - Bugfix for smiles data reader bound checking and simple LTFB data distribution - Eliminated a race condition observed in VAE ATOM model with SMILES data reader. Added a barrier after each data store mini-batch exchange -- avoid race between non-blocking sends and receives and later GPU kernel communication. Retired features:
2 parents d0fbac3 + 6a0f8bf commit 13b5167

File tree

211 files changed

+9989
-1268
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

211 files changed

+9989
-1268
lines changed

CMakeLists.txt

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -26,9 +26,9 @@ if (NOT DEFINED BUILD_SHARED_LIBS)
2626
set(BUILD_SHARED_LIBS ON)
2727
endif ()
2828

29-
# Build with at least C++11 standard; allow newer standards.
29+
# Build with at least C++14 standard; allow newer standards.
3030
if (NOT CMAKE_CXX_STANDARD OR CMAKE_CXX_STANDARD EQUAL 98)
31-
set(CMAKE_CXX_STANDARD 11)
31+
set(CMAKE_CXX_STANDARD 14)
3232
set(CMAKE_CXX_STANDARD_REQUIRED TRUE)
3333
endif ()
3434

@@ -48,7 +48,7 @@ endif ()
4848
#
4949

5050
set(LBANN_VERSION_MAJOR 0)
51-
set(LBANN_VERSION_MINOR 100)
51+
set(LBANN_VERSION_MINOR 101)
5252
set(LBANN_VERSION_PATCH 0)
5353

5454
set(LBANN_VERSION "${LBANN_VERSION_MAJOR}.${LBANN_VERSION_MINOR}.${LBANN_VERSION_PATCH}")
@@ -188,16 +188,20 @@ set(LBANN_HAS_CEREAL ${CEREAL_FOUND})
188188
# The imported target is just called "cereal". Super.
189189

190190
# Setup the linear algebra library
191-
find_package(Hydrogen 1.3.3 NO_MODULE QUIET
191+
find_package(Hydrogen 1.4.0 NO_MODULE QUIET
192192
HINTS ${Hydrogen_DIR} ${HYDROGEN_DIR} $ENV{Hydrogen_DIR} $ENV{HYDROGEN_DIR}
193193
PATH_SUFFIXES lib/cmake/hydrogen
194194
NO_DEFAULT_PATH)
195195
if (NOT Hydrogen_FOUND)
196-
find_package(Hydrogen 1.3.3 NO_MODULE QUIET REQUIRED)
196+
find_package(Hydrogen 1.4.0 NO_MODULE QUIET REQUIRED)
197197
endif ()
198198
message(STATUS "Found Hydrogen: ${Hydrogen_DIR}")
199199
set(LBANN_HAS_HYDROGEN ${Hydrogen_FOUND})
200200

201+
if (_HYDROGEN_HAVE_ROCM)
202+
message(FATAL_ERROR "ROCm not yet supported in LBANN.")
203+
endif ()
204+
201205
# DiHydrogen and Distconv
202206
if (LBANN_WITH_DISTCONV AND NOT LBANN_WITH_DIHYDROGEN)
203207
message(FATAL_ERROR "Distconv requires DiHydrogen. Enable DiHydrogen to use Distconv.")
@@ -260,7 +264,7 @@ if (LBANN_HAS_CUDA)
260264
enable_language(CUDA)
261265

262266
if (NOT CMAKE_CUDA_STANDARD OR CMAKE_CUDA_STANDARD EQUAL 98)
263-
set(CMAKE_CUDA_STANDARD 11)
267+
set(CMAKE_CUDA_STANDARD 14)
264268
endif ()
265269

266270
set(CMAKE_CUDA_STANDARD_REQUIRED TRUE)
@@ -271,13 +275,13 @@ if (LBANN_WITH_ALUMINUM)
271275
if (NOT Aluminum_FOUND)
272276
message(WARNING
273277
"Using Aluminum without Hydrogen support may not be well-supported.")
274-
find_package(Aluminum 0.3.0 NO_MODULE QUIET
278+
find_package(Aluminum 0.4.0 NO_MODULE QUIET
275279
HINTS ${Aluminum_DIR} ${ALUMINUM_DIR} ${AL_DIR}
276280
$ENV{Aluminum_DIR} $ENV{ALUMINUM_DIR} $ENV{AL_DIR}
277281
PATH_SUFFIXES lib64/cmake/aluminum lib/cmake/aluminum
278282
NO_DEFAULT_PATH)
279283
if (NOT Aluminum_FOUND)
280-
find_package(Aluminum 0.3.0 NO_MODULE QUIET)
284+
find_package(Aluminum 0.4.0 NO_MODULE QUIET)
281285
endif ()
282286
endif ()
283287
set(LBANN_HAS_ALUMINUM ${Aluminum_FOUND})

ReleaseNotes.txt

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,75 @@ Bug fixes:
2121

2222
Retired features:
2323

24+
============================== Release Notes: v0.101 ==============================
25+
26+
Support for new training algorithms:
27+
28+
Support for new network structures:
29+
- ATOM VAE model
30+
- Graph neural networks
31+
- Graph Convolutional Networks (GCN)
32+
- 3D U-Net Model
33+
34+
Support for new layers:
35+
- Implemented optimized GRU layer using cuDNN kernel
36+
- Graph Layers: GCN, GIN, Graph, GatedGraph
37+
38+
Python front-end:
39+
- Support for Graph and Graph Convolutional Networks
40+
- Added support for OCLF data center (Summit)
41+
42+
Performance optimizations:
43+
- Optimize CUDA kernel for tensor reordering in GRU layer
44+
- Enabled TensorCore optimization for GRU layer
45+
- GCN and Graph layers also have a faster Dense variant which only utilizes Matrix Multiplication
46+
47+
Model portability & usability:
48+
- Added Users Quickstart section to documentation including PyTorch
49+
to LBANN mini-tutorial
50+
- Added section on callbacks with detailed instructions on summarize
51+
images callback
52+
53+
Internal features:
54+
- Support for double data type in distributed embedding layer
55+
- Support for large number of channels in GPU batchnorm layer
56+
- Modified LTFB so that NaNs lose tournaments
57+
- Improved numerical stability of reconstruction loss in ATOM VAE
58+
model
59+
- Skip bad gradients in Adam
60+
61+
I/O & data readers:
62+
- Added support for ImageNet data reader to use sample lists
63+
- Refactored sample list code to be more flexible and generalize
64+
beyond JAG data reader
65+
- Added support for slab-based I/O in HDF5 data reader required by
66+
DistConv implementations of CosmoFlow 3D volumes
67+
- Extended slab-based HDF5 data reader to support labels and
68+
reconstruction modes for use with U-Net architecture
69+
70+
Datasets:
71+
- Added two graph datasets (MNIST, and PROTEINS)
72+
73+
Build system and Dependent Libraries:
74+
- Hydrogen 1.4.0
75+
- Aluminum 0.4.0
76+
- Spack v0.15.4+ (Requires new format for environments)
77+
- cuDNN 8.0.2
78+
- Require C++14
79+
- Added Spack build support for OCLF data center (Summit)
80+
81+
Bug fixes:
82+
- Properly reset data coordinator after each LTFB round
83+
- Fixed bug in weights proxy when weights buffer is reallocated
84+
- Bugfix for smiles data reader bound checking and simple LTFB data
85+
distribution
86+
- Eliminated a race condition observed in VAE ATOM model with SMILES
87+
data reader. Added a barrier after each data store mini-batch
88+
exchange -- avoid race between non-blocking sends and receives and
89+
later GPU kernel communication.
90+
91+
Retired features:
92+
2493
============================== Release Notes: v0.100 ==============================
2594
Support for new network structures:
2695
- 3D molecular generation models for Metal Organic Frameworks from the CoRE MOF Database.

0 commit comments

Comments
 (0)