Releases: hpcaitech/ColossalAI
Version v0.1.7 Released Today
Version v0.1.7 Released Today
Highlights
- Started torch.fx for auto-parallel training
- Update the zero mechanism with ColoTensor
- Fixed various bugs
What's Changed
Hotfix
- [hotfix] prevent nested ZeRO (#1140) by ver217
- [hotfix]fix bugs caused by refactored pipeline (#1133) by YuliangLiu0306
- [hotfix] fix param op hook (#1131) by ver217
- [hotfix] fix zero init ctx numel (#1128) by ver217
- [hotfix]change to fit latest p2p (#1100) by YuliangLiu0306
- [hotfix] fix chunk comm src rank (#1072) by ver217
Zero
- [zero] avoid zero hook spam by changing log to debug level (#1137) by Frank Lee
- [zero] added error message to handle on-the-fly import of torch Module class (#1135) by Frank Lee
- [zero] fixed api consistency (#1098) by Frank Lee
- [zero] zero optim copy chunk rather than copy tensor (#1070) by ver217
Optim
Ddp
- [ddp] add save/load state dict for ColoDDP (#1127) by ver217
- [ddp] add set_params_to_ignore for ColoDDP (#1122) by ver217
- [ddp] supported customized torch ddp configuration (#1123) by Frank Lee
Pipeline
- [pipeline]support List of Dict data (#1125) by YuliangLiu0306
- [pipeline] supported more flexible dataflow control for pipeline parallel training (#1108) by Frank Lee
- [pipeline] refactor the pipeline module (#1087) by Frank Lee
Fx
- [fx]add autoparallel passes (#1121) by YuliangLiu0306
- [fx] added unit test for coloproxy (#1119) by Frank Lee
- [fx] added coloproxy (#1115) by Frank Lee
Gemini
- [gemini] gemini mgr supports "cpu" placement policy (#1118) by ver217
- [gemini] zero supports gemini (#1093) by ver217
Test
- [test] fixed hybrid parallel test case on 8 GPUs (#1106) by Frank Lee
- [test] skip tests when not enough GPUs are detected (#1090) by Frank Lee
- [test] ignore 8 gpu test (#1080) by Frank Lee
Release
Tensor
- [tensor] refactor param op hook (#1097) by ver217
- [tensor] refactor chunk mgr and impl MemStatsCollectorV2 (#1077) by ver217
- [Tensor] fix equal assert (#1091) by Ziyue Jiang
- [Tensor] 1d row embedding (#1075) by Ziyue Jiang
- [tensor] chunk manager monitor mem usage (#1076) by ver217
- [Tensor] fix optimizer for CPU parallel (#1069) by Ziyue Jiang
- [Tensor] add hybrid device demo and fix bugs (#1059) by Ziyue Jiang
Amp
Workflow
- [workflow] fixed 8-gpu test workflow (#1101) by Frank Lee
- [workflow] added regular 8 GPU testing (#1099) by Frank Lee
- [workflow] disable p2p via shared memory on non-nvlink machine (#1086) by Frank Lee
Engine
Doc
Context
- [context] support lazy init of module (#1088) by Frank Lee
- [context] maintain the context object in with statement (#1073) by Frank Lee
Refactory
- [refactory] add nn.parallel module (#1068) by Jiarui Fang
Cudnn
Full Changelog: v0.1.7...v0.1.6
v0.1.6 Released!
Main features
- ColoTensor supports hybrid parallel (tensor parallel and data parallel)
- ColoTensor supports ZeRO (with chunk)
- Config tensor parallel by module via ColoTensor
- ZeroInitContext and ShardedModelV2 support loading checkpoint and hugging face
from_pretrain()
What's Changed
ColoTensor
- [tensor] refactor colo-tensor by @ver217 in #992
- [tensor] refactor parallel action by @ver217 in #1007
- [tensor] impl ColoDDP for ColoTensor by @ver217 in #1009
- [Tensor] add module handler for linear by @Wesley-Jzy in #1021
- [Tensor] add module check and bert test by @Wesley-Jzy in #1031
- [Tensor] add Parameter inheritance for ColoParameter by @Wesley-Jzy in #1041
- [tensor] ColoTensor supports ZeRo by @ver217 in #1015
- [zero] add chunk size search for chunk manager by @ver217 in #1052
Zero
- [zero] add load_state_dict for sharded model by @ver217 in #894
- [zero] add zero optimizer for ColoTensor by @ver217 in #1046
Hotfix
- [hotfix] fix colo init context by @ver217 in #1026
- [hotfix] fix some bugs caused by size mismatch. by @YuliangLiu0306 in #1011
- [kernel] fixed the include bug in dropout kernel by @FrankLeeeee in #999
- fix typo in constants by @ryanrussell in #1027
- [engine] fixed bug in gradient accumulation dataloader to keep the last step by @FrankLeeeee in #1030
- [hotfix] fix dist spec mgr by @ver217 in #1045
- [hotfix] fix import error in sharded model v2 by @ver217 in #1053
Unit test
CI
- [ci] update the docker image name by @FrankLeeeee in #1017
- [ci] added nightly build (#1018) by @FrankLeeeee in #1019
- [ci] fixed nightly build workflow by @FrankLeeeee in #1022
- [ci] fixed nightly build workflow by @FrankLeeeee in #1029
- [ci] fixed nightly build workflow by @FrankLeeeee in #1040
CLI
- [cli] remove unused imports by @FrankLeeeee in #1001
Documentation
- Hotfix/format by @binmakeswell in #987
- [doc] update docker instruction by @FrankLeeeee in #1020
Misc
- [NFC] Hotfix/format by @binmakeswell in #984
- Revert "[NFC] Hotfix/format" by @ver217 in #986
- remove useless import in tensor dir by @feifeibear in #997
- [NFC] fix download link by @binmakeswell in #998
- [Bot] Synchronize Submodule References by @github-actions in #1003
- [NFC] polish colossalai/kernel/cuda_native/csrc/colossal_C_frontend.c… by @zhengzangw in #1010
- [NFC] fix paper link by @binmakeswell in #1012
- [p2p]add object list send/recv by @YuliangLiu0306 in #1024
- [Bot] Synchronize Submodule References by @github-actions in #1034
- [NFC] add inference by @binmakeswell in #1044
- [titans]remove model zoo by @YuliangLiu0306 in #1042
- [NFC] add inference submodule in path by @binmakeswell in #1047
- [release] update version.txt by @FrankLeeeee in #1048
- [Bot] Synchronize Submodule References by @github-actions in #1049
- updated collective ops api by @kurisusnowdeng in #1054
- [pipeline]refactor ppschedule to support tensor list by @YuliangLiu0306 in #1050
New Contributors
- @ryanrussell made their first contribution in #1027
Full Changelog: v0.1.5...v0.1.6
v0.1.5 Released!
Main Features
- Enhance ColoTensor and build a demo to train BERT (from hugging face) using Tensor Parallelism without modifying model.
What's Changed
ColoTensor
- [Tensor] add ColoTensor TP1Dcol Embedding by @Wesley-Jzy in #899
- [Tensor] add embedding tp1d row by @Wesley-Jzy in #904
- [Tensor] update pytest.mark.parametrize in tensor tests by @Wesley-Jzy in #913
- [Tensor] init ColoParameter by @feifeibear in #914
- [Tensor] add a basic bert. by @Wesley-Jzy in #911
- [Tensor] polish model test by @feifeibear in #915
- [Tensor] fix test_model by @Wesley-Jzy in #916
- [Tensor] add 1d vocab loss by @Wesley-Jzy in #918
- [Graph] building computing graph with ColoTensor, Linear only by @feifeibear in #917
- [Tensor] add from_pretrained support and bert pretrained test by @Wesley-Jzy in #921
- [Tensor] test pretrain loading on multi-process by @feifeibear in #922
- [tensor] hijack addmm for colo tensor by @ver217 in #923
- [tensor] colo tensor overrides mul by @ver217 in #927
- [Tensor] simplify named param by @Wesley-Jzy in #928
- [Tensor] fix init context by @Wesley-Jzy in #931
- [Tensor] add optimizer to bert test by @Wesley-Jzy in #933
- [tensor] design DistSpec and DistSpecManager for ColoTensor by @ver217 in #934
- [Tensor] add DistSpec for loss and test_model by @Wesley-Jzy in #947
- [tensor] derive compute pattern from dist spec by @ver217 in #971
Pipeline Parallelism
- [pipelinable]use pipelinable to support GPT model. by @YuliangLiu0306 in #903
CI
- [CI] add CI for releasing bdist wheel by @ver217 in #901
- [CI] fix release bdist CI by @ver217 in #902
- [ci] added wheel build scripts by @FrankLeeeee in #910
Misc
- [Bot] Synchronize Submodule References by @github-actions in #907
- [Bot] Synchronize Submodule References by @github-actions in #912
- [setup] update cuda ext cc flags by @ver217 in #919
- [setup] support more cuda architectures by @ver217 in #920
- [NFC] update results on a single GPU, highlight quick view by @binmakeswell in #981
Full Changelog: v0.1.4...v0.1.5
v0.1.4 Released!
Main Features
Here are the main improvements of this release:
- ColoTensor: A data structure that unifies the Tensor representation of different parallel methods.
- Gemini: More efficient Genimi implementation reduces the overhead of model data statistic collection.
- CLI: a command-line tool that helps users launch distributed training tasks more easily.
- Pipeline Parallelism (PP): a more user-friendly API for PP.
What's Changed
ColoTensor
- [tensor]fix colo_tensor torch_function by @Wesley-Jzy in #825
- [tensor]fix test_linear by @Wesley-Jzy in #826
- [tensor] ZeRO use ColoTensor as the base class. by @feifeibear in #828
- [tensor] revert zero tensors back by @feifeibear in #829
- [Tensor] overriding paramters() for Module using ColoTensor by @feifeibear in #889
- [tensor] refine linear and add gather for laynorm by @Wesley-Jzy in #893
- [Tensor] test parameters() as member function by @feifeibear in #896
- [Tensor] activation is an attr of ColoTensor by @feifeibear in #897
- [Tensor] initialize the ColoOptimizer by @feifeibear in #898
- [tensor] reorganize files by @feifeibear in #820
- [Tensor] apply ColoTensor on Torch functions by @feifeibear in #821
- [Tensor] update ColoTensor torch_function by @feifeibear in #822
- [tensor] lazy init by @feifeibear in #823
- [WIP] Applying ColoTensor on TP-1D-row Linear. by @feifeibear in #831
- Init Conext supports lazy allocate model memory by @feifeibear in #842
- [Tensor] TP Linear 1D row by @Wesley-Jzy in #843
- [Tensor] add assert for colo_tensor 1Drow by @Wesley-Jzy in #846
- [Tensor] init a simple network training with ColoTensor by @feifeibear in #849
- [Tensor ] Add 1Drow weight reshard by spec by @Wesley-Jzy in #854
- [Tensor] add layer norm Op by @feifeibear in #852
- [tensor] an initial dea of tensor spec by @feifeibear in #865
- [Tensor] colo init context add device attr. by @feifeibear in #866
- [tensor] add cross_entropy_loss by @feifeibear in #868
- [Tensor] Add function to spec and update linear 1Drow and unit tests by @Wesley-Jzy in #869
- [tensor] customized op returns ColoTensor by @feifeibear in #875
- [Tensor] get named parameters for model using ColoTensors by @feifeibear in #874
- [Tensor] Add some attributes to ColoTensor by @feifeibear in #877
- [Tensor] make a simple net works with 1D row TP by @feifeibear in #879
- [tensor] wrap function in the torch_tensor to ColoTensor by @Wesley-Jzy in #881
- [Tensor] make ColoTensor more robust for getattr by @feifeibear in #886
- [Tensor] test model check results for a simple net by @feifeibear in #887
- [tensor] add ColoTensor 1Dcol by @Wesley-Jzy in #888
Gemini + ZeRO
- [zero] add zero tensor shard strategy by @1SAA in #793
- Revert "[zero] add zero tensor shard strategy" by @feifeibear in #806
- [gemini] a new tensor structure by @feifeibear in #818
- [gemini] APIs to set cpu memory capacity by @feifeibear in #809
- [DO NOT MERGE] [zero] init fp16 params directly in ZeroInitContext by @ver217 in #808
- [gemini] collect cpu-gpu moving volume in each iteration by @feifeibear in #813
- [gemini] add GeminiMemoryManger by @1SAA in #832
- [zero] use GeminiMemoryManager when sampling model data by @ver217 in #850
- [gemini] polish code by @1SAA in #855
- [gemini] add stateful tensor container by @1SAA in #867
- [gemini] polish stateful_tensor_mgr by @1SAA in #876
- [gemini] accelerate adjust_layout() by @ver217 in #878
CLI
- [cli] added distributed launcher command by @YuliangLiu0306 in #791
- [cli] added micro benchmarking for tp by @YuliangLiu0306 in #789
- [cli] add missing requirement by @FrankLeeeee in #805
- [cli] fixed a bug in user args and refactored the module structure by @FrankLeeeee in #807
- [cli] fixed single-node process launching by @FrankLeeeee in #812
- [cli] added check installation cli by @FrankLeeeee in #815
- [CLI] refactored the launch CLI and fixed bugs in multi-node launching by @FrankLeeeee in #844
- [cli] refactored micro-benchmarking cli and added more metrics by @FrankLeeeee in #858
Pipeline Parallelism
- [pipelinable]use pipelinable context to initialize non-pipeline model by @YuliangLiu0306 in #816
- [pipelinable]use ColoTensor to replace dummy tensor. by @YuliangLiu0306 in #853
Misc
- [hotfix] fix auto tensor placement policy by @ver217 in #775
- [hotfix] change the check assert in split batch 2d by @Wesley-Jzy in #772
- [hotfix] fix bugs in zero by @1SAA in #781
- [hotfix] fix grad offload when enabling reuse_fp16_shard by @ver217 in #784
- [refactor] moving memtracer to gemini by @feifeibear in #801
- [log] display tflops if available by @feifeibear in #802
- [refactor] moving grad acc logic to engine by @feifeibear in #804
- [log] local throughput metrics by @feifeibear in #811
- [Bot] Synchronize Submodule References by @github-actions in #810
- [Bot] Synchronize Submodule References by @github-actions in #819
- [refactor] moving InsertPostInitMethodToModuleSubClasses to utils. by @feifeibear in #824
- [setup] allow installation with python 3.6 by @FrankLeeeee in #834
- Revert "[WIP] Applying ColoTensor on TP-1D-row Linear." by @feifeibear in #835
- [dependency] removed torchvision by @FrankLeeeee in #833
- [Bot] Synchronize Submodule References by @github-actions in #827
- [unittest] refactored unit tests for change in dependency by @FrankLeeeee in #838
- [setup] use env var instead of option for cuda ext by @FrankLeeeee in #839
- [hotfix] ColoTensor pin_memory by @feifeibear in #840
- modefied the pp build for ckpt adaptation by @Gy-Lu in #803
- [hotfix] the bug of numel() in ColoTensor by @feifeibear in #845
- [hotfix] fix _post_init_method of zero init ctx by @ver217 in #847
- [hotfix] add deconstructor for stateful tensor by @ver217 in #848
- [utils] refactor profiler by @ver217 in #837
- [ci] cache cuda extension by @FrankLeeeee in #860
- hotfix tensor unittest bugs by @feifeibear in #862
- [usability] added assertion message in registry by @FrankLeeeee in #864
- [doc] improved docstring in the communication module by @FrankLeeeee in #863
- [doc] improved docstring in the logging module by @FrankLeeeee in #861
- [doc] improved docstring in the amp module by @FrankLeeeee in #857
- [usability] improved error messages in the context modu...
V0.1.3 Released!
Overview
Here are the main improvements of this release:
- Gemini: Heterogeneous memory space manager
- Refactor the API of pipeline parallelism
What's Changed
Features
- [zero] initialize a stateful tensor manager by @feifeibear in #614
- [pipeline] refactor pipeline by @YuliangLiu0306 in #679
- [zero] stateful tensor manager by @ver217 in #687
- [zero] adapt zero hooks for unsharded module by @1SAA in #699
- [zero] refactor memstats collector by @ver217 in #706
- [zero] improve adaptability for not-shard parameters by @1SAA in #708
- [zero] check whether gradients have inf and nan in gpu by @1SAA in #712
- [refactor] refactor the memory utils by @feifeibear in #715
- [util] support detection of number of processes on current node by @FrankLeeeee in #723
- [utils] add synchronized cuda memory monitor by @1SAA in #740
- [zero] refactor ShardedParamV2 by @1SAA in #742
- [zero] add tensor placement policies by @ver217 in #743
- [zero] use factory pattern for tensor_placement_policy by @feifeibear in #752
- [zero] refactor memstats_collector by @1SAA in #746
- [gemini] init genimi individual directory by @feifeibear in #754
- refactor shard and gather operation by @1SAA in #773
Bug Fix
- [zero] fix init bugs in zero context by @1SAA in #686
- [hotfix] update requirements-test by @ver217 in #701
- [hotfix] fix a bug in 3d vocab parallel embedding by @kurisusnowdeng in #707
- [compatibility] fixed tensor parallel compatibility with torch 1.9 by @FrankLeeeee in #700
- [hotfix]fixed bugs of assigning grad states to non leaf nodes by @Gy-Lu in #711
- [hotfix] fix stateful tensor manager's cuda model data size by @ver217 in #710
- [bug] fixed broken test_found_inf by @FrankLeeeee in #725
- [util] fixed activation checkpointing on torch 1.9 by @FrankLeeeee in #719
- [util] fixed communication API with PyTorch 1.9 by @FrankLeeeee in #721
- [bug] removed zero installation requirements by @FrankLeeeee in #731
- [hotfix] remove duplicated param register to stateful tensor manager by @feifeibear in #728
- [utils] correct cpu memory used and capacity in the context of multi-process by @feifeibear in #726
- [bug] fixed grad scaler compatibility with torch 1.8 by @FrankLeeeee in #735
- [bug] fixed DDP compatibility with torch 1.8 by @FrankLeeeee in #739
- [hotfix] fix memory leak in backward of sharded model by @ver217 in #741
- [hotfix] fix initialize about zero by @ver217 in #748
- [hotfix] fix prepare grads in sharded optim by @ver217 in #749
- [hotfix] layernorm by @kurisusnowdeng in #750
- [hotfix] fix auto tensor placement policy by @ver217 in #753
- [hotfix] fix reuse_fp16_shard of sharded model by @ver217 in #756
- [hotfix] fix test_stateful_tensor_mgr by @ver217 in #762
- [compatibility] used backward-compatible API for global process group by @FrankLeeeee in #758
- [hotfix] fix the ckpt hook bugs when using DDP by @Gy-Lu in #769
- [hotfix] polish sharded optim docstr and warning by @ver217 in #770
Unit Testing
- [ci] replace the ngc docker image with self-built pytorch image by @FrankLeeeee in #672
- [ci] fixed compatibility workflow by @FrankLeeeee in #678
- [ci] update workflow trigger condition and support options by @FrankLeeeee in #691
- [ci] added missing field in workflow by @FrankLeeeee in #692
- [ci] remove ipc config for rootless docker by @FrankLeeeee in #694
- [test] added missing decorators to model checkpointing tests by @FrankLeeeee in #727
- [unitest] add checkpoint for moe zero test by @1SAA in #729
- [test] added a decorator for address already in use error with backward compatibility by @FrankLeeeee in #760
- [test] refactored with the new rerun decorator by @FrankLeeeee in #763
Documentation
- add PaLM link by @binmakeswell in #704
- [doc] removed outdated installation command by @FrankLeeeee in #730
- add video by @binmakeswell in #732
- [readme] polish readme by @feifeibear in #764
- [readme] sync CN readme by @binmakeswell in #766
Miscellaneous
- [Bot] Synchronize Submodule References by @github-actions in #556
- [Bot] Synchronize Submodule References by @github-actions in #695
- [refactor] zero directory by @feifeibear in #724
- [Bot] Synchronize Submodule References by @github-actions in #751
Full Changelog: v0.1.2...v0.1.3
V0.1.2 Released!
Overview
Here are the main improvements of this release:
- MOE and BERT models can be trained with ZeRO.
- Provide a uniform checkpoint for all kinds of parallelism.
- Optimize ZeRO-offload, and improve model scaling.
- Design a uniform model memory tracer.
- Implement an efficient hybrid Adam (CPU and CUDA kernels).
- Improve activation offloading.
- Profiler TensorBoard plugin of Beta version.
- Refactor pipeline module for closer integration with engine.
- Chinese tutorials, WeChat and Slack user groups.
What's Changed
Features
- [zero] get memory usage for sharded param by @feifeibear in #536
- [zero] improve the accuracy of get_memory_usage of sharded param by @feifeibear in #538
- [zero] refactor model data tracing by @feifeibear in #537
- [zero] get memory usage of sharded optim v2. by @feifeibear in #542
- [zero] polish ZeroInitContext by @ver217 in #540
- [zero] optimize grad offload by @ver217 in #539
- [zero] non model data tracing by @feifeibear in #545
- [zero] add zero config to neutralize zero context init by @1SAA in #546
- [zero] dump memory stats for sharded model by @feifeibear in #548
- [zero] add stateful tensor by @feifeibear in #549
- [zero] label state for param fp16 and grad by @feifeibear in #551
- [zero] hijack p.grad in sharded model by @ver217 in #554
- [utils] update colo tensor moving APIs by @feifeibear in #553
- [polish] rename col_attr -> colo_attr by @feifeibear in #558
- [zero] trace states of fp16/32 grad and fp32 param by @ver217 in #571
- [zero] adapt zero for unsharded parameters by @1SAA in #561
- [refactor] memory utils by @feifeibear in #577
- Feature/checkpoint gloo by @kurisusnowdeng in #589
- [zero] add sampling time for memstats collector by @Gy-Lu in #610
- [model checkpoint] checkpoint utils by @kurisusnowdeng in #592
- [model checkpoint][hotfix] unified layers for save&load by @kurisusnowdeng in #593
- Feature/checkpoint 2D by @kurisusnowdeng in #595
- Feature/checkpoint 1D by @kurisusnowdeng in #594
- [model checkpoint] CPU communication ops by @kurisusnowdeng in #590
- Feature/checkpoint 2.5D by @kurisusnowdeng in #596
- Feature/Checkpoint 3D by @kurisusnowdeng in #597
- [model checkpoint] checkpoint hook by @kurisusnowdeng in #598
- Feature/Checkpoint tests by @kurisusnowdeng in #599
- [zero] adapt zero for unsharded parameters (Optimizer part) by @1SAA in #601
- [zero] polish init context by @feifeibear in #645
- refactor pipeline---put runtime schedule into engine. by @YuliangLiu0306 in #627
Bug Fix
- [Zero] process no-leaf-module in Zero by @1SAA in #535
- Add gather_out arg to Linear by @Wesley-Jzy in #541
- [hoxfix] fix parallel_input flag for Linear1D_Col gather_output by @Wesley-Jzy in #579
- [hotfix] add hybrid adam to init by @ver217 in #584
- Hotfix/path check util by @kurisusnowdeng in #591
- [hotfix] fix sharded optim zero grad by @ver217 in #604
- Add tensor parallel input check by @Wesley-Jzy in #621
- [hotfix] Raise messages for indivisible batch sizes with tensor parallelism by @number1roy in #622
- [zero] fixed the activation offload by @Gy-Lu in #647
- fixed bugs in CPU adam by @1SAA in #633
- Revert "[zero] polish init context" by @feifeibear in #657
- [hotfix] fix a bug in model data stats tracing by @feifeibear in #655
- fix bugs for unsharded parameters when restore data by @1SAA in #664
Unit Testing
- [zero] test zero tensor utils by @FredHuang99 in #609
- remove hybrid adam in test_moe_zero_optim by @1SAA in #659
Documentation
- Refactored docstring to google style by @number1roy in #532
- [docs] updatad docs of hybrid adam and cpu adam by @Gy-Lu in #552
- html refactor by @number1roy in #555
- [doc] polish docstring of zero by @ver217 in #612
- [doc] update rst by @ver217 in #615
- [doc] polish amp docstring by @ver217 in #616
- [doc] polish moe docsrting by @ver217 in #618
- [doc] polish optimizer docstring by @ver217 in #619
- [doc] polish utils docstring by @ver217 in #620
- [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/cuda_util.cu … by @GaryGky in #625
- [doc] polish checkpoint docstring by @ver217 in #637
- update GPT-2 experiment result by @Sze-qq in #666
- [NFC] polish code by @binmakeswell in #646
Model Zoo
Miscellaneous
- [logging] polish logger format by @feifeibear in #543
- [profiler] add MemProfiler by @raejaf in #356
- [Bot] Synchronize Submodule References by @github-actions in #501
- [tool] create .clang-format for pre-commit by @BoxiangW in #578
- [GitHub] Add prefix and label in issue template by @binmakeswell in #652
Full Changelog: v0.1.1...v0.1.2
V0.1.1 Released Today!
What's Changed
Features
- [MOE] changed parallelmode to dist process group by @1SAA in #460
- [MOE] redirect moe_env from global_variables to core by @1SAA in #467
- [zero] zero init ctx receives a dp process group by @ver217 in #471
- [zero] ZeRO supports pipeline parallel by @ver217 in #477
- add LinearGate for MOE in NaiveAMP context by @1SAA in #480
- [zero] polish sharded param name by @feifeibear in #484
- [zero] sharded optim support hybrid cpu adam by @ver217 in #486
- [zero] polish sharded optimizer v2 by @ver217 in #490
- [MOE] support PR-MOE by @1SAA in #488
- [zero] sharded model manages ophooks individually by @ver217 in #492
- [MOE] remove old MoE legacy by @1SAA in #493
- [zero] sharded model support the reuse of fp16 shard by @ver217 in #495
- [polish] polish singleton and global context by @feifeibear in #500
- [memory] add model data tensor moving api by @feifeibear in #503
- [memory] set cuda mem frac by @feifeibear in #506
- [zero] use colo model data api in sharded optimv2 by @feifeibear in #511
- [MOE] add MOEGPT model by @1SAA in #510
- [zero] zero init ctx enable rm_torch_payload_on_the_fly by @ver217 in #512
- [zero] show model data cuda memory usage after zero context init. by @feifeibear in #515
- [log] polish disable_existing_loggers by @ver217 in #519
- [zero] add model data tensor inline moving API by @feifeibear in #521
- [cuda] modify the fused adam, support hybrid of fp16 and fp32 by @Gy-Lu in #497
- [zero] refactor model data tracing by @feifeibear in #522
- [zero] added hybrid adam, removed loss scale in adam by @Gy-Lu in #527
Bug Fix
- fix discussion buttom in issue template by @binmakeswell in #504
- [zero] fix grad offload by @feifeibear in #528
Unit Testing
- [MOE] add unitest for MOE experts layout, gradient handler and kernel by @1SAA in #469
- [test] added rerun on exception for testing by @FrankLeeeee in #475
- [zero] fix init device bug in zero init context unittest by @feifeibear in #516
- [test] fixed rerun_on_exception and adapted test cases by @FrankLeeeee in #487
CI/CD
- [devops] remove tsinghua source for pip by @FrankLeeeee in #505
- [devops] remove tsinghua source for pip by @FrankLeeeee in #507
- [devops] recover tsinghua pip source due to proxy issue by @FrankLeeeee in #509
Documentation
- [doc] update rst by @ver217 in #470
- Update Experiment result about Colossal-AI with ZeRO by @Sze-qq in #479
- [doc] docs get correct release version by @ver217 in #489
- Update README.md by @fastalgo in #514
- [doc] update apidoc by @ver217 in #530
Model Zoo
- [model zoo] fix attn mask shape of gpt by @ver217 in #472
- [model zoo] gpt embedding remove attn mask by @ver217 in #474
Miscellaneous
- [install] run with out rich by @feifeibear in #513
- [refactor] remove old zero code by @feifeibear in #517
- [format] polish name format for MOE by @feifeibear in #481
New Contributors
Full Changelog: v0.1.0...v0.1.1
V0.1.0 Released Today!
Overview
We are happy to release the version v0.1.0
today. Compared to the previous version, we have a brand new zero module and updated many aspects of our system for better performance and usability. The latest version can be installed by pip install colossalai
now. We will update our examples and documentation in the next few days accordingly.
Highlights:
Note:
a. Only the major base commits are chosen to display. Successive commits which enhance/update the base commit are not shown.
b. Some commits do not have associated pull request ID for some unknown reasons.
c. The list is ordered by time.
Features
- add moe context, moe utilities and refactor gradient handler (#455 )By @1SAA
- [zero] Update initialize for ZeRO (#458 ) By @ver217
- [zero] hybrid cpu adam (#445 ) By @feifeibear
- added Multiply Jitter and capacity factor eval for MOE (#434 ) By @1SAA
- [fp16] refactored fp16 optimizer (#392 ) By @FrankLeeeee
- [zero] memtracer to record cuda memory usage of model data and overall system (#395 ) By @feifeibear
- Added tensor detector (#393 ) By @Gy-Lu
- Added activation offload (#331 ) By @Gy-Lu
- [zero] zero init context collect numel of model (#375 ) By @feifeibear
- Added PCIE profiler to dectect data transmission (#373 ) By @1SAA
- Added Profiler Context to manage all profilers (#340 ) By @1SAA
- set criterion as optional in colossalai initialize (#336 ) By @FrankLeeeee
- [zero] Update sharded model v2 using sharded param v2 (#323 ) By @ver217
- [zero] zero init context (#321 ) By @feifeibear
- Added profiler communication operations By @1SAA
- added buffer sync to naive amp model wrapper (#291 ) By @FrankLeeeee
- [zero] cpu adam kernel (#288 ) By @Gy-Lu
- Feature/zero (#279 ) By @feifeibear @FrankLeeeee @ver217
- impl shard optim v2 and add unit test By @ver217
- [profiler] primary memory tracer By @raejaf
- add sharded adam By @ver217
Unit Testing
- [test] fixed amp convergence comparison test (#454 ) By @FrankLeeeee
- [test] optimized zero data parallel test (#452 ) By @FrankLeeeee
- [test] make zero engine test really work (#447 ) By @feifeibear
- optimized context test time consumption (#446 ) By @FrankLeeeee
- [unitest] polish zero config in unittest (#438 ) By @feifeibear
- added testing module (#435 ) By @FrankLeeeee
- [zero] polish ShardedOptimV2 unittest (#385 ) By @feifeibear
- [unit test] Refactored test cases with component func (#339 ) By @FrankLeeeee
Documentation
- [doc] Update docstring for ZeRO (#459 ) By @ver217
- update README and images path (#384 ) By @binmakeswell
- add badge and contributor list By @FrankLeeeee
- add community group and update issue template (#271 ) By @binmakeswell
- update experimental visualization (#253 ) By @Sze-qq
- add Chinese README By @binmakeswell
CI/CD
- update github CI with the current workflow (#441 ) By @FrankLeeeee
- update unit testing CI rules By @FrankLeeeee
- added compatibility CI and options for release ci By @FrankLeeeee
- added pypi publication CI and remove formatting CI By @FrankLeeeee
Bug Fix
- fix gpt attention mask (#461 ) By @ver217
- [bug] Fixed device placement bug in memory monitor thread (#433 ) By @FrankLeeeee
- fixed fp16 optimizer none grad bug (#432 ) By @FrankLeeeee
- fixed gpt attention mask in pipeline (#430 ) By @FrankLeeeee
- [hotfix] fixed bugs in ShardStrategy and PcieProfiler (#394 ) By @1SAA
- fixed bug in activation checkpointing test (#387 ) By @FrankLeeeee
- [profiler] Fixed bugs in CommProfiler and PcieProfiler (#377 ) By @1SAA
- fixed CI dataset directory; fixed import error of 2.5d accuracy (#255 ) By @kurisusnowdeng
- fixed padding index issue for vocab parallel embedding layers; updated 3D linear to be compatible with examples in the tutorial By @kurisusnowdeng
Miscellaneous
- [log] better logging display with rich (#426 ) By @feifeibear
V0.0.2 Released Today!
Change Log
Added
- Unifed distributed layers
- MoE support
- DevOps tools such as github action, code review automation, etc.
- New project official website
Changes
- refactored the APIs for usability, flexibility and modularity
- adapted PyTorch AMP for tensor parallel
- refactored utilities for tensor parallel and pipeline parallel
- Separated benchmarks and examples as independent repositories
- Updated pipeline parallelism to support non-interleaved and interleaved versions
- refactored installation scripts for convenience
Fixed
- zero level 3 runtime error
- incorrect calculation in gradient clipping
v0.0.1 Colossal-AI Beta Release
Features
- Data Parallelism
- Pipeline Parallelism (experimental)
- 1D, 2D, 2.5D, 3D and sequence tensor parallelism
- Easy-to-use trainer and engine
- Extensibility for user-defined parallelism
- Mixed Precision Training
- Zero Redundancy Optimizer (ZeRO)