21 Jun 04:10

FrankLeeeee

6690a61

Version v0.1.7 Released Today

Highlights

Started torch.fx for auto-parallel training
Update the zero mechanism with ColoTensor
Fixed various bugs

What's Changed

Hotfix

[hotfix] prevent nested ZeRO (#1140) by ver217
[hotfix]fix bugs caused by refactored pipeline (#1133) by YuliangLiu0306
[hotfix] fix param op hook (#1131) by ver217
[hotfix] fix zero init ctx numel (#1128) by ver217
[hotfix]change to fit latest p2p (#1100) by YuliangLiu0306
[hotfix] fix chunk comm src rank (#1072) by ver217

Zero

[zero] avoid zero hook spam by changing log to debug level (#1137) by Frank Lee
[zero] added error message to handle on-the-fly import of torch Module class (#1135) by Frank Lee
[zero] fixed api consistency (#1098) by Frank Lee
[zero] zero optim copy chunk rather than copy tensor (#1070) by ver217

Optim

[optim] refactor fused sgd (#1134) by ver217

Ddp

[ddp] add save/load state dict for ColoDDP (#1127) by ver217
[ddp] add set_params_to_ignore for ColoDDP (#1122) by ver217
[ddp] supported customized torch ddp configuration (#1123) by Frank Lee

Pipeline

[pipeline]support List of Dict data (#1125) by YuliangLiu0306
[pipeline] supported more flexible dataflow control for pipeline parallel training (#1108) by Frank Lee
[pipeline] refactor the pipeline module (#1087) by Frank Lee

Fx

[fx]add autoparallel passes (#1121) by YuliangLiu0306
[fx] added unit test for coloproxy (#1119) by Frank Lee
[fx] added coloproxy (#1115) by Frank Lee

Gemini

[gemini] gemini mgr supports "cpu" placement policy (#1118) by ver217
[gemini] zero supports gemini (#1093) by ver217

Test

[test] fixed hybrid parallel test case on 8 GPUs (#1106) by Frank Lee
[test] skip tests when not enough GPUs are detected (#1090) by Frank Lee
[test] ignore 8 gpu test (#1080) by Frank Lee

Release

[release] update version.txt (#1103) by Frank Lee

Tensor

[tensor] refactor param op hook (#1097) by ver217
[tensor] refactor chunk mgr and impl MemStatsCollectorV2 (#1077) by ver217
[Tensor] fix equal assert (#1091) by Ziyue Jiang
[Tensor] 1d row embedding (#1075) by Ziyue Jiang
[tensor] chunk manager monitor mem usage (#1076) by ver217
[Tensor] fix optimizer for CPU parallel (#1069) by Ziyue Jiang
[Tensor] add hybrid device demo and fix bugs (#1059) by Ziyue Jiang

Amp

[amp] included dict for type casting of model output (#1102) by Frank Lee

Workflow

[workflow] fixed 8-gpu test workflow (#1101) by Frank Lee
[workflow] added regular 8 GPU testing (#1099) by Frank Lee
[workflow] disable p2p via shared memory on non-nvlink machine (#1086) by Frank Lee

Engine

[engine] fixed empty op hook check (#1096) by Frank Lee

Doc

[doc] added documentation to chunk and chunk manager (#1094) by Frank Lee

Context

[context] support lazy init of module (#1088) by Frank Lee
[context] maintain the context object in with statement (#1073) by Frank Lee

Refactory

[refactory] add nn.parallel module (#1068) by Jiarui Fang

Cudnn

[cudnn] set False to cudnn benchmark by default (#1063) by Frank Lee

Full Changelog: v0.1.7...v0.1.6

Assets 2

0 Join discussion

02 Jun 06:31

ver217

v0.1.6

b167258

v0.1.6 Released!

Main features

ColoTensor supports hybrid parallel (tensor parallel and data parallel)
ColoTensor supports ZeRO (with chunk)
Config tensor parallel by module via ColoTensor
ZeroInitContext and ShardedModelV2 support loading checkpoint and hugging face from_pretrain()

What's Changed

ColoTensor

[tensor] refactor colo-tensor by @ver217 in #992
[tensor] refactor parallel action by @ver217 in #1007
[tensor] impl ColoDDP for ColoTensor by @ver217 in #1009
[Tensor] add module handler for linear by @Wesley-Jzy in #1021
[Tensor] add module check and bert test by @Wesley-Jzy in #1031
[Tensor] add Parameter inheritance for ColoParameter by @Wesley-Jzy in #1041
[tensor] ColoTensor supports ZeRo by @ver217 in #1015
[zero] add chunk size search for chunk manager by @ver217 in #1052

Zero

[zero] add load_state_dict for sharded model by @ver217 in #894
[zero] add zero optimizer for ColoTensor by @ver217 in #1046

Hotfix

[hotfix] fix colo init context by @ver217 in #1026
[hotfix] fix some bugs caused by size mismatch. by @YuliangLiu0306 in #1011
[kernel] fixed the include bug in dropout kernel by @FrankLeeeee in #999
fix typo in constants by @ryanrussell in #1027
[engine] fixed bug in gradient accumulation dataloader to keep the last step by @FrankLeeeee in #1030
[hotfix] fix dist spec mgr by @ver217 in #1045
[hotfix] fix import error in sharded model v2 by @ver217 in #1053

Unit test

[unit test] refactor test tensor by @ver217 in #1005

CI

[ci] update the docker image name by @FrankLeeeee in #1017
[ci] added nightly build (#1018) by @FrankLeeeee in #1019
[ci] fixed nightly build workflow by @FrankLeeeee in #1022
[ci] fixed nightly build workflow by @FrankLeeeee in #1029
[ci] fixed nightly build workflow by @FrankLeeeee in #1040

CLI

[cli] remove unused imports by @FrankLeeeee in #1001

Documentation

Hotfix/format by @binmakeswell in #987
[doc] update docker instruction by @FrankLeeeee in #1020

Misc

[NFC] Hotfix/format by @binmakeswell in #984
Revert "[NFC] Hotfix/format" by @ver217 in #986
remove useless import in tensor dir by @feifeibear in #997
[NFC] fix download link by @binmakeswell in #998
[Bot] Synchronize Submodule References by @github-actions in #1003
[NFC] polish colossalai/kernel/cuda_native/csrc/colossal_C_frontend.c… by @zhengzangw in #1010
[NFC] fix paper link by @binmakeswell in #1012
[p2p]add object list send/recv by @YuliangLiu0306 in #1024
[Bot] Synchronize Submodule References by @github-actions in #1034
[NFC] add inference by @binmakeswell in #1044
[titans]remove model zoo by @YuliangLiu0306 in #1042
[NFC] add inference submodule in path by @binmakeswell in #1047
[release] update version.txt by @FrankLeeeee in #1048
[Bot] Synchronize Submodule References by @github-actions in #1049
updated collective ops api by @kurisusnowdeng in #1054
[pipeline]refactor ppschedule to support tensor list by @YuliangLiu0306 in #1050

New Contributors

@ryanrussell made their first contribution in #1027

Full Changelog: v0.1.5...v0.1.6

Contributors

ryanrussell, feifeibear, and 7 other contributors

Assets 2

0 Join discussion

17 May 01:48

ver217

v0.1.5

5898ccf

v0.1.5 Released!

Main Features

Enhance ColoTensor and build a demo to train BERT (from hugging face) using Tensor Parallelism without modifying model.

What's Changed

ColoTensor

[Tensor] add ColoTensor TP1Dcol Embedding by @Wesley-Jzy in #899
[Tensor] add embedding tp1d row by @Wesley-Jzy in #904
[Tensor] update pytest.mark.parametrize in tensor tests by @Wesley-Jzy in #913
[Tensor] init ColoParameter by @feifeibear in #914
[Tensor] add a basic bert. by @Wesley-Jzy in #911
[Tensor] polish model test by @feifeibear in #915
[Tensor] fix test_model by @Wesley-Jzy in #916
[Tensor] add 1d vocab loss by @Wesley-Jzy in #918
[Graph] building computing graph with ColoTensor, Linear only by @feifeibear in #917
[Tensor] add from_pretrained support and bert pretrained test by @Wesley-Jzy in #921
[Tensor] test pretrain loading on multi-process by @feifeibear in #922
[tensor] hijack addmm for colo tensor by @ver217 in #923
[tensor] colo tensor overrides mul by @ver217 in #927
[Tensor] simplify named param by @Wesley-Jzy in #928
[Tensor] fix init context by @Wesley-Jzy in #931
[Tensor] add optimizer to bert test by @Wesley-Jzy in #933
[tensor] design DistSpec and DistSpecManager for ColoTensor by @ver217 in #934
[Tensor] add DistSpec for loss and test_model by @Wesley-Jzy in #947
[tensor] derive compute pattern from dist spec by @ver217 in #971

Pipeline Parallelism

[pipelinable]use pipelinable to support GPT model. by @YuliangLiu0306 in #903

CI

[CI] add CI for releasing bdist wheel by @ver217 in #901
[CI] fix release bdist CI by @ver217 in #902
[ci] added wheel build scripts by @FrankLeeeee in #910

Misc

[Bot] Synchronize Submodule References by @github-actions in #907
[Bot] Synchronize Submodule References by @github-actions in #912
[setup] update cuda ext cc flags by @ver217 in #919
[setup] support more cuda architectures by @ver217 in #920
[NFC] update results on a single GPU, highlight quick view by @binmakeswell in #981

Full Changelog: v0.1.4...v0.1.5

Contributors

feifeibear, Wesley-Jzy, and 4 other contributors

Assets 2

0 Join discussion

28 Apr 07:56

feifeibear

v0.1.4

e1108ca

v0.1.4 Released!

Main Features

Here are the main improvements of this release:

ColoTensor: A data structure that unifies the Tensor representation of different parallel methods.
Gemini: More efficient Genimi implementation reduces the overhead of model data statistic collection.
CLI: a command-line tool that helps users launch distributed training tasks more easily.
Pipeline Parallelism (PP): a more user-friendly API for PP.

What's Changed

ColoTensor

[tensor]fix colo_tensor torch_function by @Wesley-Jzy in #825
[tensor]fix test_linear by @Wesley-Jzy in #826
[tensor] ZeRO use ColoTensor as the base class. by @feifeibear in #828
[tensor] revert zero tensors back by @feifeibear in #829
[Tensor] overriding paramters() for Module using ColoTensor by @feifeibear in #889
[tensor] refine linear and add gather for laynorm by @Wesley-Jzy in #893
[Tensor] test parameters() as member function by @feifeibear in #896
[Tensor] activation is an attr of ColoTensor by @feifeibear in #897
[Tensor] initialize the ColoOptimizer by @feifeibear in #898
[tensor] reorganize files by @feifeibear in #820
[Tensor] apply ColoTensor on Torch functions by @feifeibear in #821
[Tensor] update ColoTensor torch_function by @feifeibear in #822
[tensor] lazy init by @feifeibear in #823
[WIP] Applying ColoTensor on TP-1D-row Linear. by @feifeibear in #831
Init Conext supports lazy allocate model memory by @feifeibear in #842
[Tensor] TP Linear 1D row by @Wesley-Jzy in #843
[Tensor] add assert for colo_tensor 1Drow by @Wesley-Jzy in #846
[Tensor] init a simple network training with ColoTensor by @feifeibear in #849
[Tensor ] Add 1Drow weight reshard by spec by @Wesley-Jzy in #854
[Tensor] add layer norm Op by @feifeibear in #852
[tensor] an initial dea of tensor spec by @feifeibear in #865
[Tensor] colo init context add device attr. by @feifeibear in #866
[tensor] add cross_entropy_loss by @feifeibear in #868
[Tensor] Add function to spec and update linear 1Drow and unit tests by @Wesley-Jzy in #869
[tensor] customized op returns ColoTensor by @feifeibear in #875
[Tensor] get named parameters for model using ColoTensors by @feifeibear in #874
[Tensor] Add some attributes to ColoTensor by @feifeibear in #877
[Tensor] make a simple net works with 1D row TP by @feifeibear in #879
[tensor] wrap function in the torch_tensor to ColoTensor by @Wesley-Jzy in #881
[Tensor] make ColoTensor more robust for getattr by @feifeibear in #886
[Tensor] test model check results for a simple net by @feifeibear in #887
[tensor] add ColoTensor 1Dcol by @Wesley-Jzy in #888

Gemini + ZeRO

[zero] add zero tensor shard strategy by @1SAA in #793
Revert "[zero] add zero tensor shard strategy" by @feifeibear in #806
[gemini] a new tensor structure by @feifeibear in #818
[gemini] APIs to set cpu memory capacity by @feifeibear in #809
[DO NOT MERGE] [zero] init fp16 params directly in ZeroInitContext by @ver217 in #808
[gemini] collect cpu-gpu moving volume in each iteration by @feifeibear in #813
[gemini] add GeminiMemoryManger by @1SAA in #832
[zero] use GeminiMemoryManager when sampling model data by @ver217 in #850
[gemini] polish code by @1SAA in #855
[gemini] add stateful tensor container by @1SAA in #867
[gemini] polish stateful_tensor_mgr by @1SAA in #876
[gemini] accelerate adjust_layout() by @ver217 in #878

CLI

[cli] added distributed launcher command by @YuliangLiu0306 in #791
[cli] added micro benchmarking for tp by @YuliangLiu0306 in #789
[cli] add missing requirement by @FrankLeeeee in #805
[cli] fixed a bug in user args and refactored the module structure by @FrankLeeeee in #807
[cli] fixed single-node process launching by @FrankLeeeee in #812
[cli] added check installation cli by @FrankLeeeee in #815
[CLI] refactored the launch CLI and fixed bugs in multi-node launching by @FrankLeeeee in #844
[cli] refactored micro-benchmarking cli and added more metrics by @FrankLeeeee in #858

Pipeline Parallelism

[pipelinable]use pipelinable context to initialize non-pipeline model by @YuliangLiu0306 in #816
[pipelinable]use ColoTensor to replace dummy tensor. by @YuliangLiu0306 in #853

Misc

[hotfix] fix auto tensor placement policy by @ver217 in #775
[hotfix] change the check assert in split batch 2d by @Wesley-Jzy in #772
[hotfix] fix bugs in zero by @1SAA in #781
[hotfix] fix grad offload when enabling reuse_fp16_shard by @ver217 in #784
[refactor] moving memtracer to gemini by @feifeibear in #801
[log] display tflops if available by @feifeibear in #802
[refactor] moving grad acc logic to engine by @feifeibear in #804
[log] local throughput metrics by @feifeibear in #811
[Bot] Synchronize Submodule References by @github-actions in #810
[Bot] Synchronize Submodule References by @github-actions in #819
[refactor] moving InsertPostInitMethodToModuleSubClasses to utils. by @feifeibear in #824
[setup] allow installation with python 3.6 by @FrankLeeeee in #834
Revert "[WIP] Applying ColoTensor on TP-1D-row Linear." by @feifeibear in #835
[dependency] removed torchvision by @FrankLeeeee in #833
[Bot] Synchronize Submodule References by @github-actions in #827
[unittest] refactored unit tests for change in dependency by @FrankLeeeee in #838
[setup] use env var instead of option for cuda ext by @FrankLeeeee in #839
[hotfix] ColoTensor pin_memory by @feifeibear in #840
modefied the pp build for ckpt adaptation by @Gy-Lu in #803
[hotfix] the bug of numel() in ColoTensor by @feifeibear in #845
[hotfix] fix _post_init_method of zero init ctx by @ver217 in #847
[hotfix] add deconstructor for stateful tensor by @ver217 in #848
[utils] refactor profiler by @ver217 in #837
[ci] cache cuda extension by @FrankLeeeee in #860
hotfix tensor unittest bugs by @feifeibear in #862
[usability] added assertion message in registry by @FrankLeeeee in #864
[doc] improved docstring in the communication module by @FrankLeeeee in #863
[doc] improved docstring in the logging module by @FrankLeeeee in #861
[doc] improved docstring in the amp module by @FrankLeeeee in #857
[usability] improved error messages in the context modu...

Contributors

feifeibear, Wesley-Jzy, and 5 other contributors

Assets 2

16 Apr 09:13

ver217

v0.1.3

38102cf

V0.1.3 Released!

Overview

Here are the main improvements of this release:

Gemini: Heterogeneous memory space manager
Refactor the API of pipeline parallelism

What's Changed

Features

[zero] initialize a stateful tensor manager by @feifeibear in #614
[pipeline] refactor pipeline by @YuliangLiu0306 in #679
[zero] stateful tensor manager by @ver217 in #687
[zero] adapt zero hooks for unsharded module by @1SAA in #699
[zero] refactor memstats collector by @ver217 in #706
[zero] improve adaptability for not-shard parameters by @1SAA in #708
[zero] check whether gradients have inf and nan in gpu by @1SAA in #712
[refactor] refactor the memory utils by @feifeibear in #715
[util] support detection of number of processes on current node by @FrankLeeeee in #723
[utils] add synchronized cuda memory monitor by @1SAA in #740
[zero] refactor ShardedParamV2 by @1SAA in #742
[zero] add tensor placement policies by @ver217 in #743
[zero] use factory pattern for tensor_placement_policy by @feifeibear in #752
[zero] refactor memstats_collector by @1SAA in #746
[gemini] init genimi individual directory by @feifeibear in #754
refactor shard and gather operation by @1SAA in #773

Bug Fix

[zero] fix init bugs in zero context by @1SAA in #686
[hotfix] update requirements-test by @ver217 in #701
[hotfix] fix a bug in 3d vocab parallel embedding by @kurisusnowdeng in #707
[compatibility] fixed tensor parallel compatibility with torch 1.9 by @FrankLeeeee in #700
[hotfix]fixed bugs of assigning grad states to non leaf nodes by @Gy-Lu in #711
[hotfix] fix stateful tensor manager's cuda model data size by @ver217 in #710
[bug] fixed broken test_found_inf by @FrankLeeeee in #725
[util] fixed activation checkpointing on torch 1.9 by @FrankLeeeee in #719
[util] fixed communication API with PyTorch 1.9 by @FrankLeeeee in #721
[bug] removed zero installation requirements by @FrankLeeeee in #731
[hotfix] remove duplicated param register to stateful tensor manager by @feifeibear in #728
[utils] correct cpu memory used and capacity in the context of multi-process by @feifeibear in #726
[bug] fixed grad scaler compatibility with torch 1.8 by @FrankLeeeee in #735
[bug] fixed DDP compatibility with torch 1.8 by @FrankLeeeee in #739
[hotfix] fix memory leak in backward of sharded model by @ver217 in #741
[hotfix] fix initialize about zero by @ver217 in #748
[hotfix] fix prepare grads in sharded optim by @ver217 in #749
[hotfix] layernorm by @kurisusnowdeng in #750
[hotfix] fix auto tensor placement policy by @ver217 in #753
[hotfix] fix reuse_fp16_shard of sharded model by @ver217 in #756
[hotfix] fix test_stateful_tensor_mgr by @ver217 in #762
[compatibility] used backward-compatible API for global process group by @FrankLeeeee in #758
[hotfix] fix the ckpt hook bugs when using DDP by @Gy-Lu in #769
[hotfix] polish sharded optim docstr and warning by @ver217 in #770

Unit Testing

[ci] replace the ngc docker image with self-built pytorch image by @FrankLeeeee in #672
[ci] fixed compatibility workflow by @FrankLeeeee in #678
[ci] update workflow trigger condition and support options by @FrankLeeeee in #691
[ci] added missing field in workflow by @FrankLeeeee in #692
[ci] remove ipc config for rootless docker by @FrankLeeeee in #694
[test] added missing decorators to model checkpointing tests by @FrankLeeeee in #727
[unitest] add checkpoint for moe zero test by @1SAA in #729
[test] added a decorator for address already in use error with backward compatibility by @FrankLeeeee in #760
[test] refactored with the new rerun decorator by @FrankLeeeee in #763

Documentation

add PaLM link by @binmakeswell in #704
[doc] removed outdated installation command by @FrankLeeeee in #730
add video by @binmakeswell in #732
[readme] polish readme by @feifeibear in #764
[readme] sync CN readme by @binmakeswell in #766

Miscellaneous

[Bot] Synchronize Submodule References by @github-actions in #556
[Bot] Synchronize Submodule References by @github-actions in #695
[refactor] zero directory by @feifeibear in #724
[Bot] Synchronize Submodule References by @github-actions in #751

Full Changelog: v0.1.2...v0.1.3

Contributors

feifeibear, kurisusnowdeng, and 6 other contributors

Assets 2

2 Join discussion

06 Apr 05:45

ver217

v0.1.2

03e1d35

V0.1.2 Released!

Overview

Here are the main improvements of this release:

MOE and BERT models can be trained with ZeRO.
Provide a uniform checkpoint for all kinds of parallelism.
Optimize ZeRO-offload, and improve model scaling.
Design a uniform model memory tracer.
Implement an efficient hybrid Adam (CPU and CUDA kernels).
Improve activation offloading.
Profiler TensorBoard plugin of Beta version.
Refactor pipeline module for closer integration with engine.
Chinese tutorials, WeChat and Slack user groups.

What's Changed

Features

[zero] get memory usage for sharded param by @feifeibear in #536
[zero] improve the accuracy of get_memory_usage of sharded param by @feifeibear in #538
[zero] refactor model data tracing by @feifeibear in #537
[zero] get memory usage of sharded optim v2. by @feifeibear in #542
[zero] polish ZeroInitContext by @ver217 in #540
[zero] optimize grad offload by @ver217 in #539
[zero] non model data tracing by @feifeibear in #545
[zero] add zero config to neutralize zero context init by @1SAA in #546
[zero] dump memory stats for sharded model by @feifeibear in #548
[zero] add stateful tensor by @feifeibear in #549
[zero] label state for param fp16 and grad by @feifeibear in #551
[zero] hijack p.grad in sharded model by @ver217 in #554
[utils] update colo tensor moving APIs by @feifeibear in #553
[polish] rename col_attr -> colo_attr by @feifeibear in #558
[zero] trace states of fp16/32 grad and fp32 param by @ver217 in #571
[zero] adapt zero for unsharded parameters by @1SAA in #561
[refactor] memory utils by @feifeibear in #577
Feature/checkpoint gloo by @kurisusnowdeng in #589
[zero] add sampling time for memstats collector by @Gy-Lu in #610
[model checkpoint] checkpoint utils by @kurisusnowdeng in #592
[model checkpoint][hotfix] unified layers for save&load by @kurisusnowdeng in #593
Feature/checkpoint 2D by @kurisusnowdeng in #595
Feature/checkpoint 1D by @kurisusnowdeng in #594
[model checkpoint] CPU communication ops by @kurisusnowdeng in #590
Feature/checkpoint 2.5D by @kurisusnowdeng in #596
Feature/Checkpoint 3D by @kurisusnowdeng in #597
[model checkpoint] checkpoint hook by @kurisusnowdeng in #598
Feature/Checkpoint tests by @kurisusnowdeng in #599
[zero] adapt zero for unsharded parameters (Optimizer part) by @1SAA in #601
[zero] polish init context by @feifeibear in #645
refactor pipeline---put runtime schedule into engine. by @YuliangLiu0306 in #627

Bug Fix

[Zero] process no-leaf-module in Zero by @1SAA in #535
Add gather_out arg to Linear by @Wesley-Jzy in #541
[hoxfix] fix parallel_input flag for Linear1D_Col gather_output by @Wesley-Jzy in #579
[hotfix] add hybrid adam to init by @ver217 in #584
Hotfix/path check util by @kurisusnowdeng in #591
[hotfix] fix sharded optim zero grad by @ver217 in #604
Add tensor parallel input check by @Wesley-Jzy in #621
[hotfix] Raise messages for indivisible batch sizes with tensor parallelism by @number1roy in #622
[zero] fixed the activation offload by @Gy-Lu in #647
fixed bugs in CPU adam by @1SAA in #633
Revert "[zero] polish init context" by @feifeibear in #657
[hotfix] fix a bug in model data stats tracing by @feifeibear in #655
fix bugs for unsharded parameters when restore data by @1SAA in #664

Unit Testing

[zero] test zero tensor utils by @FredHuang99 in #609
remove hybrid adam in test_moe_zero_optim by @1SAA in #659

Documentation

Refactored docstring to google style by @number1roy in #532
[docs] updatad docs of hybrid adam and cpu adam by @Gy-Lu in #552
html refactor by @number1roy in #555
[doc] polish docstring of zero by @ver217 in #612
[doc] update rst by @ver217 in #615
[doc] polish amp docstring by @ver217 in #616
[doc] polish moe docsrting by @ver217 in #618
[doc] polish optimizer docstring by @ver217 in #619
[doc] polish utils docstring by @ver217 in #620
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/cuda_util.cu … by @GaryGky in #625
[doc] polish checkpoint docstring by @ver217 in #637
update GPT-2 experiment result by @Sze-qq in #666
[NFC] polish code by @binmakeswell in #646

Model Zoo

[model zoo] add activation offload for gpt model by @Gy-Lu in #582

Miscellaneous

[logging] polish logger format by @feifeibear in #543
[profiler] add MemProfiler by @raejaf in #356
[Bot] Synchronize Submodule References by @github-actions in #501
[tool] create .clang-format for pre-commit by @BoxiangW in #578
[GitHub] Add prefix and label in issue template by @binmakeswell in #652

Full Changelog: v0.1.1...v0.1.2

Contributors

feifeibear, kurisusnowdeng, and 12 other contributors

Assets 2

0 Join discussion

26 Mar 07:19

ver217

v0.1.1

56ad945

V0.1.1 Released Today!

What's Changed

Features

[MOE] changed parallelmode to dist process group by @1SAA in #460
[MOE] redirect moe_env from global_variables to core by @1SAA in #467
[zero] zero init ctx receives a dp process group by @ver217 in #471
[zero] ZeRO supports pipeline parallel by @ver217 in #477
add LinearGate for MOE in NaiveAMP context by @1SAA in #480
[zero] polish sharded param name by @feifeibear in #484
[zero] sharded optim support hybrid cpu adam by @ver217 in #486
[zero] polish sharded optimizer v2 by @ver217 in #490
[MOE] support PR-MOE by @1SAA in #488
[zero] sharded model manages ophooks individually by @ver217 in #492
[MOE] remove old MoE legacy by @1SAA in #493
[zero] sharded model support the reuse of fp16 shard by @ver217 in #495
[polish] polish singleton and global context by @feifeibear in #500
[memory] add model data tensor moving api by @feifeibear in #503
[memory] set cuda mem frac by @feifeibear in #506
[zero] use colo model data api in sharded optimv2 by @feifeibear in #511
[MOE] add MOEGPT model by @1SAA in #510
[zero] zero init ctx enable rm_torch_payload_on_the_fly by @ver217 in #512
[zero] show model data cuda memory usage after zero context init. by @feifeibear in #515
[log] polish disable_existing_loggers by @ver217 in #519
[zero] add model data tensor inline moving API by @feifeibear in #521
[cuda] modify the fused adam, support hybrid of fp16 and fp32 by @Gy-Lu in #497
[zero] refactor model data tracing by @feifeibear in #522
[zero] added hybrid adam, removed loss scale in adam by @Gy-Lu in #527

Bug Fix

fix discussion buttom in issue template by @binmakeswell in #504
[zero] fix grad offload by @feifeibear in #528

Unit Testing

[MOE] add unitest for MOE experts layout, gradient handler and kernel by @1SAA in #469
[test] added rerun on exception for testing by @FrankLeeeee in #475
[zero] fix init device bug in zero init context unittest by @feifeibear in #516
[test] fixed rerun_on_exception and adapted test cases by @FrankLeeeee in #487

CI/CD

[devops] remove tsinghua source for pip by @FrankLeeeee in #505
[devops] remove tsinghua source for pip by @FrankLeeeee in #507
[devops] recover tsinghua pip source due to proxy issue by @FrankLeeeee in #509

Documentation

[doc] update rst by @ver217 in #470
Update Experiment result about Colossal-AI with ZeRO by @Sze-qq in #479
[doc] docs get correct release version by @ver217 in #489
Update README.md by @fastalgo in #514
[doc] update apidoc by @ver217 in #530

Model Zoo

[model zoo] fix attn mask shape of gpt by @ver217 in #472
[model zoo] gpt embedding remove attn mask by @ver217 in #474

Miscellaneous

[install] run with out rich by @feifeibear in #513
[refactor] remove old zero code by @feifeibear in #517
[format] polish name format for MOE by @feifeibear in #481

New Contributors

@fastalgo made their first contribution in #514

Full Changelog: v0.1.0...v0.1.1

Contributors

feifeibear, fastalgo, and 6 other contributors

Assets 2

0 Join discussion

19 Mar 03:18

FrankLeeeee

v0.1.0

8f9617c

V0.1.0 Released Today!

Overview

We are happy to release the version v0.1.0 today. Compared to the previous version, we have a brand new zero module and updated many aspects of our system for better performance and usability. The latest version can be installed by pip install colossalai now. We will update our examples and documentation in the next few days accordingly.

Highlights:

Note:
a. Only the major base commits are chosen to display. Successive commits which enhance/update the base commit are not shown.
b. Some commits do not have associated pull request ID for some unknown reasons.
c. The list is ordered by time.

Features

add moe context, moe utilities and refactor gradient handler (#455 )By @1SAA
[zero] Update initialize for ZeRO (#458 ) By @ver217
[zero] hybrid cpu adam (#445 ) By @feifeibear
added Multiply Jitter and capacity factor eval for MOE (#434 ) By @1SAA
[fp16] refactored fp16 optimizer (#392 ) By @FrankLeeeee
[zero] memtracer to record cuda memory usage of model data and overall system (#395 ) By @feifeibear
Added tensor detector (#393 ) By @Gy-Lu
Added activation offload (#331 ) By @Gy-Lu
[zero] zero init context collect numel of model (#375 ) By @feifeibear
Added PCIE profiler to dectect data transmission (#373 ) By @1SAA
Added Profiler Context to manage all profilers (#340 ) By @1SAA
set criterion as optional in colossalai initialize (#336 ) By @FrankLeeeee
[zero] Update sharded model v2 using sharded param v2 (#323 ) By @ver217
[zero] zero init context (#321 ) By @feifeibear
Added profiler communication operations By @1SAA
added buffer sync to naive amp model wrapper (#291 ) By @FrankLeeeee
[zero] cpu adam kernel (#288 ) By @Gy-Lu
Feature/zero (#279 ) By @feifeibear @FrankLeeeee @ver217
impl shard optim v2 and add unit test By @ver217
[profiler] primary memory tracer By @raejaf
add sharded adam By @ver217

Unit Testing

[test] fixed amp convergence comparison test (#454 ) By @FrankLeeeee
[test] optimized zero data parallel test (#452 ) By @FrankLeeeee
[test] make zero engine test really work (#447 ) By @feifeibear
optimized context test time consumption (#446 ) By @FrankLeeeee
[unitest] polish zero config in unittest (#438 ) By @feifeibear
added testing module (#435 ) By @FrankLeeeee
[zero] polish ShardedOptimV2 unittest (#385 ) By @feifeibear
[unit test] Refactored test cases with component func (#339 ) By @FrankLeeeee

Documentation

[doc] Update docstring for ZeRO (#459 ) By @ver217
update README and images path (#384 ) By @binmakeswell
add badge and contributor list By @FrankLeeeee
add community group and update issue template (#271 ) By @binmakeswell
update experimental visualization (#253 ) By @Sze-qq
add Chinese README By @binmakeswell

CI/CD

update github CI with the current workflow (#441 ) By @FrankLeeeee
update unit testing CI rules By @FrankLeeeee
added compatibility CI and options for release ci By @FrankLeeeee
added pypi publication CI and remove formatting CI By @FrankLeeeee

Bug Fix

fix gpt attention mask (#461 ) By @ver217
[bug] Fixed device placement bug in memory monitor thread (#433 ) By @FrankLeeeee
fixed fp16 optimizer none grad bug (#432 ) By @FrankLeeeee
fixed gpt attention mask in pipeline (#430 ) By @FrankLeeeee
[hotfix] fixed bugs in ShardStrategy and PcieProfiler (#394 ) By @1SAA
fixed bug in activation checkpointing test (#387 ) By @FrankLeeeee
[profiler] Fixed bugs in CommProfiler and PcieProfiler (#377 ) By @1SAA
fixed CI dataset directory; fixed import error of 2.5d accuracy (#255 ) By @kurisusnowdeng
fixed padding index issue for vocab parallel embedding layers; updated 3D linear to be compatible with examples in the tutorial By @kurisusnowdeng

Miscellaneous

[log] better logging display with rich (#426 ) By @feifeibear

Contributors

feifeibear, kurisusnowdeng, and 7 other contributors

Assets 2

1 Join discussion

15 Feb 03:31

FrankLeeeee

v0.0.2

f5ca88e

V0.0.2 Released Today!

Change Log

Added

Unifed distributed layers
MoE support
DevOps tools such as github action, code review automation, etc.
New project official website

Changes

refactored the APIs for usability, flexibility and modularity
adapted PyTorch AMP for tensor parallel
refactored utilities for tensor parallel and pipeline parallel
Separated benchmarks and examples as independent repositories
Updated pipeline parallelism to support non-interleaved and interleaved versions
refactored installation scripts for convenience

Fixed

zero level 3 runtime error
incorrect calculation in gradient clipping

Assets 2

3 Join discussion

28 Oct 16:49

kurisusnowdeng

v0.0.1-beta

3245a69

v0.0.1 Colossal-AI Beta Release

Features

Data Parallelism
Pipeline Parallelism (experimental)
1D, 2D, 2.5D, 3D and sequence tensor parallelism
Easy-to-use trainer and engine
Extensibility for user-defined parallelism
Mixed Precision Training
Zero Redundancy Optimizer (ZeRO)

Assets 2

Releases: hpcaitech/ColossalAI

Version v0.1.7 Released Today

Version v0.1.7 Released Today

Highlights

What's Changed

Hotfix

Zero

Optim

Ddp

Pipeline

Fx

Gemini

Test

Release

Tensor

Amp

Workflow

Engine

Doc

Context

Refactory

Cudnn

Uh oh!

v0.1.6 Released!

Main features

What's Changed

ColoTensor

Zero

Hotfix

Unit test

CI

CLI

Documentation

Misc

New Contributors

Contributors

Uh oh!

v0.1.5 Released!

Main Features

What's Changed

ColoTensor

Pipeline Parallelism

CI

Misc

Contributors

Uh oh!

v0.1.4 Released!

Main Features

What's Changed

ColoTensor

Gemini + ZeRO

CLI

Pipeline Parallelism

Misc

Contributors

Uh oh!

V0.1.3 Released!

Overview

What's Changed

Features

Bug Fix

Unit Testing

Documentation

Miscellaneous

Contributors

Uh oh!

V0.1.2 Released!

Overview

What's Changed

Features

Bug Fix

Unit Testing

Documentation

Model Zoo

Miscellaneous

Contributors

Uh oh!

V0.1.1 Released Today!

What's Changed

Features