Skip to content

Commit acf1dac

Browse files
committed
============================== Release Notes: v0.104 ==============================
C++ API: Support for new training algorithms: Support for new network structures: - Added GPT-3 transformers and training recipes Support for new layers: - Select operator (set tensor value based on predicate) - Model parallelism for channel-wise fully-connected layers Python front-end: - Support for PyTorch Module conversion to LBANN graphs (requires PyTorch 2.0 or newer, compiled with PyTorch Dynamo) Performance optimizations: - Support in-place computations for capable layers as a memory optimization - Allow distconv-enabled convolution and batchnorm layers to reuse their input activations as error signals as a memory optimization if the parent layer does not need its activations in the backward pass. This optimization can be disabled by setting the environment variable DISTCONV_DISABLE_MEM_OPT=1. - Added support for selective weight sharding (also known as Fully-Sharded Data Parallelism, or FSDP). To enable, set sharded=true on weight objects. - Allow distconv to be disabled at runtime with LBANN_DISABLE_DISTCONV=1. - Activations are now deallocated when no longer needed via a reference counter, disable with LBANN_DISABLE_ACT_GC=1. - Added option for LBANN to set the number of OMP threads to modest default (4) if the environment doesn't specify anything. - Save memory on backpropagation by not replicating gradients between GradientManager and data_type_optimizer - Save more memory in FSDP by synchronizing previous outstanding async communication calls and freeing up local gradient contributions - FSDP: release full weight views after backprop - Batching heads in multi-head attention into single operations instead of on a per-head basis - Stacking the weights and biases for queries/keys/values in self-attention Model portability & usability: - Added support for profiling with Caliper Experiments & Applications: - Updated CosmoFlow model to automatically scale the model architecture and parallelism with input size. - Added a PyTorch reference implementation of CosmoFlow. Internal features: - Removed the mini_batch_size parameter from the following functions in the layer class hierarchy: fp_setup_inputs, fp_setup_outputs, bp_setup_gradient_wrt_inputs and the distconv_adapter class: fp_setup, bp_setup - Support global and local gradient norm clipping with the clip_gradient_norm callback - Interactive progress bar with the progress_bar callback - Evaluate progress callback allows for periodic monitoring during training with independent data set (intra-epoch evaluation) - Detailed memory usage profiling with the memory_profiler callback - Refactored subgraph parallelism I/O & data readers: - Renamed percent_of_data_to_use more accurately to fraction_of_data_to_use. - DataReaderMetaData, training_dr_linearized_data_size, and num_parallel_readers were removed from the model and layer API, and instead reside in the data ingestion pipeline. - Fixed implementation of background I/O to achive better decoupling of background data fetch. Can be enabled / disabled with runtime flag. - Set the default number of I/O threads to 4 - Changed the I/O and transform pipeline to use a bank of RNGs that is now indexed by the sample ID in the load sequence, rather than the I/O thread ID. This eliminates variablility when using different numbers of I/O threads. - Moved state tracking current position in a data set from the data reader to the dataset class. - Split the I/O RNGs into two banks one for training and one for all other execution modes. Build system: - Updated build script to use CachedCMakeProject mode, which should simplfy the overall workflow - Set a default time limit for CI tests to avoid unnecessary stalls Bug fixes: - Fixed a bug where in-place layers sometimes attached a locked view of a matrix to a mutable view. - Fixed a bug when trying to use the legacy HDF5 data reader without data store. - Fixed concurrency bugs in the data store - Fixed DistConv memory optimization bug Retired features: - Support for autoencoder strategy in the summarize images callback was removed - Removed deprecated Layer protobuf fields: weight_data, num_neurons_from_data_reader - Removed support for calculating a global mini-batch across multiple models using the imcomm callback or multiple trainers. The mini-batch is now strictly contained to a single model in a single trainer. This deprecates an unused (and old) multi-model execution mode using imcomm callback that predated LTFB. - Removed the notion of effective mini-batch size versus current mini-batch size. - Remove world master mini-batch adjustment. - Remove model offset field. No longer necessary since data sets do not span models. - Remove the cached value of the current mini-batch size from the SGD execution context. It is now only cached in the model. - Removed the imcomm "inter-model" callback - Removed the num-parallel-readers parameter to the I/O subsystem. This eliminates an older version of I/O parallelism that relied on a non-data-parallel I/O buffer and had different ranks fetching entire mini-batches. It is superseded by standard data-parallel I/O.
1 parent 80eef8b commit acf1dac

File tree

1 file changed

+27
-0
lines changed

1 file changed

+27
-0
lines changed

ReleaseNotes.txt

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,33 @@ C++ API:
33

44
Support for new training algorithms:
55

6+
Support for new network structures:
7+
8+
Support for new layers:
9+
10+
Python front-end:
11+
12+
Performance optimizations:
13+
14+
Model portability & usability:
15+
16+
Experiments & Applications:
17+
18+
Internal features:
19+
20+
I/O & data ingestion:
21+
22+
Build system:
23+
24+
Bug fixes:
25+
26+
Retired features:
27+
28+
============================== Release Notes: v0.104 ==============================
29+
C++ API:
30+
31+
Support for new training algorithms:
32+
633
Support for new network structures:
734
- Added GPT-3 transformers and training recipes
835

0 commit comments

Comments
 (0)