@@ -4,10 +4,11 @@ C++ API:
44Support for new training algorithms:
55
66Support for new network structures:
7- - Added GPT-3 transformers and training recipes
7+ - Added GPT-3 transformers and training recipes
88
99Support for new layers:
10- - Select operator (set tensor value based on predicate)
10+ - Select operator (set tensor value based on predicate)
11+ - Model parallelism for channel-wise fully-connected layers
1112
1213Python front-end:
1314 - Support for PyTorch Module conversion to LBANN graphs (requires PyTorch 2.0
@@ -20,24 +21,42 @@ Performance optimizations:
2021 layer does not need its activations in the backward pass. This optimization
2122 can be disabled by setting the environment variable
2223 DISTCONV_DISABLE_MEM_OPT=1.
23- - Allow weights to be distributed across ranks by sharding them. Enable by
24- setting sharded=True in any weights object.
24+ - Added support for selective weight sharding (also known as
25+ Fully-Sharded Data Parallelism, or FSDP). To enable, set sharded=true
26+ on weight objects.
2527 - Allow distconv to be disabled at runtime with LBANN_DISABLE_DISTCONV=1.
2628 - Activations are now deallocated when no longer needed via a reference counter,
2729 disable with LBANN_DISABLE_ACT_GC=1.
2830 - Added option for LBANN to set the number of OMP threads to modest
2931 default (4) if the environment doesn't specify anything.
32+ - Save memory on backpropagation by not replicating gradients between
33+ GradientManager and data_type_optimizer
34+ - Save more memory in FSDP by synchronizing previous outstanding
35+ async communication calls and freeing up local gradient contributions
36+ - FSDP: release full weight views after backprop
37+ - Batching heads in multi-head attention into single operations
38+ instead of on a per-head basis
39+ - Stacking the weights and biases for queries/keys/values in
40+ self-attention
3041
3142Model portability & usability:
43+ - Added support for profiling with Caliper
3244
3345Experiments & Applications:
46+ - Updated CosmoFlow model to automatically scale the model
47+ architecture and parallelism with input size.
48+ - Added a PyTorch reference implementation of CosmoFlow.
3449
3550Internal features:
36- - Fixed a bug where in-place layers sometimes attached a locked view
37- of a matrix to a mutable view.
3851 - Removed the mini_batch_size parameter from the following functions
3952 in the layer class hierarchy: fp_setup_inputs, fp_setup_outputs, bp_setup_gradient_wrt_inputs
4053 and the distconv_adapter class: fp_setup, bp_setup
54+ - Support global and local gradient norm clipping with the clip_gradient_norm callback
55+ - Interactive progress bar with the progress_bar callback
56+ - Evaluate progress callback allows for periodic monitoring during
57+ training with independent data set (intra-epoch evaluation)
58+ - Detailed memory usage profiling with the memory_profiler callback
59+ - Refactored subgraph parallelism
4160
4261I/O & data readers:
4362 - Renamed percent_of_data_to_use more accurately to fraction_of_data_to_use.
@@ -63,6 +82,11 @@ Build system:
6382 - Set a default time limit for CI tests to avoid unnecessary stalls
6483
6584Bug fixes:
85+ - Fixed a bug where in-place layers sometimes attached a locked view
86+ of a matrix to a mutable view.
87+ - Fixed a bug when trying to use the legacy HDF5 data reader without data store.
88+ - Fixed concurrency bugs in the data store
89+ - Fixed DistConv memory optimization bug
6690
6791Retired features:
6892 - Support for autoencoder strategy in the summarize images callback was removed
0 commit comments