-
Notifications
You must be signed in to change notification settings - Fork 79
Commit acf1dac
committed
============================== Release Notes: v0.104 ==============================
C++ API:
Support for new training algorithms:
Support for new network structures:
- Added GPT-3 transformers and training recipes
Support for new layers:
- Select operator (set tensor value based on predicate)
- Model parallelism for channel-wise fully-connected layers
Python front-end:
- Support for PyTorch Module conversion to LBANN graphs (requires PyTorch 2.0
or newer, compiled with PyTorch Dynamo)
Performance optimizations:
- Support in-place computations for capable layers as a memory optimization
- Allow distconv-enabled convolution and batchnorm layers to reuse their
input activations as error signals as a memory optimization if the parent
layer does not need its activations in the backward pass. This optimization
can be disabled by setting the environment variable
DISTCONV_DISABLE_MEM_OPT=1.
- Added support for selective weight sharding (also known as
Fully-Sharded Data Parallelism, or FSDP). To enable, set sharded=true
on weight objects.
- Allow distconv to be disabled at runtime with LBANN_DISABLE_DISTCONV=1.
- Activations are now deallocated when no longer needed via a reference counter,
disable with LBANN_DISABLE_ACT_GC=1.
- Added option for LBANN to set the number of OMP threads to modest
default (4) if the environment doesn't specify anything.
- Save memory on backpropagation by not replicating gradients between
GradientManager and data_type_optimizer
- Save more memory in FSDP by synchronizing previous outstanding
async communication calls and freeing up local gradient contributions
- FSDP: release full weight views after backprop
- Batching heads in multi-head attention into single operations
instead of on a per-head basis
- Stacking the weights and biases for queries/keys/values in
self-attention
Model portability & usability:
- Added support for profiling with Caliper
Experiments & Applications:
- Updated CosmoFlow model to automatically scale the model
architecture and parallelism with input size.
- Added a PyTorch reference implementation of CosmoFlow.
Internal features:
- Removed the mini_batch_size parameter from the following functions
in the layer class hierarchy: fp_setup_inputs, fp_setup_outputs, bp_setup_gradient_wrt_inputs
and the distconv_adapter class: fp_setup, bp_setup
- Support global and local gradient norm clipping with the clip_gradient_norm callback
- Interactive progress bar with the progress_bar callback
- Evaluate progress callback allows for periodic monitoring during
training with independent data set (intra-epoch evaluation)
- Detailed memory usage profiling with the memory_profiler callback
- Refactored subgraph parallelism
I/O & data readers:
- Renamed percent_of_data_to_use more accurately to fraction_of_data_to_use.
- DataReaderMetaData, training_dr_linearized_data_size, and num_parallel_readers
were removed from the model and layer API, and instead reside in the data
ingestion pipeline.
- Fixed implementation of background I/O to achive better decoupling
of background data fetch. Can be enabled / disabled with runtime
flag.
- Set the default number of I/O threads to 4
- Changed the I/O and transform pipeline to use a bank of RNGs that
is now indexed by the sample ID in the load sequence, rather than the
I/O thread ID. This eliminates variablility when using different
numbers of I/O threads.
- Moved state tracking current position in a data set from the data
reader to the dataset class.
- Split the I/O RNGs into two banks one for training and one for all
other execution modes.
Build system:
- Updated build script to use CachedCMakeProject mode, which should
simplfy the overall workflow
- Set a default time limit for CI tests to avoid unnecessary stalls
Bug fixes:
- Fixed a bug where in-place layers sometimes attached a locked view
of a matrix to a mutable view.
- Fixed a bug when trying to use the legacy HDF5 data reader without data store.
- Fixed concurrency bugs in the data store
- Fixed DistConv memory optimization bug
Retired features:
- Support for autoencoder strategy in the summarize images callback was removed
- Removed deprecated Layer protobuf fields: weight_data,
num_neurons_from_data_reader
- Removed support for calculating a global mini-batch across multiple
models using the imcomm callback or multiple trainers. The
mini-batch is now strictly contained to a single model in a single
trainer. This deprecates an unused (and old) multi-model
execution mode using imcomm callback that predated LTFB.
- Removed the notion of effective mini-batch size versus current mini-batch size.
- Remove world master mini-batch adjustment.
- Remove model offset field. No longer necessary since data sets do not span models.
- Remove the cached value of the current mini-batch size from the SGD
execution context. It is now only cached in the model.
- Removed the imcomm "inter-model" callback
- Removed the num-parallel-readers parameter to the I/O subsystem.
This eliminates an older version of I/O parallelism that relied on
a non-data-parallel I/O buffer and had different ranks fetching
entire mini-batches. It is superseded by standard data-parallel I/O.1 parent 80eef8b commit acf1dacCopy full SHA for acf1dac
File tree
Expand file treeCollapse file tree
1 file changed
+27
-0
lines changedOpen diff view settings
Filter options
Expand file treeCollapse file tree
1 file changed
+27
-0
lines changedOpen diff view settings
Collapse file
+27Lines changed: 27 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
6 | 33 | | |
7 | 34 | | |
8 | 35 | | |
| |||
0 commit comments