Releases: hpcaitech/ColossalAI
Releases · hpcaitech/ColossalAI
Version v0.1.13 Release Today!
What's Changed
Version
- [version] 0.1.13 (#2152) by Jiarui Fang
- Revert "[version] version to v0.1.13 (#2139)" (#2153) by Jiarui Fang
- [version] version to v0.1.13 (#2139) by Jiarui Fang
Gemini
- [Gemini] GeminiDPP convert to PyTorch Module. (#2151) by Jiarui Fang
- [Gemini] Update coloinit_ctx to support meta_tensor (#2147) by BlueRum
- [Gemini] revert ZeROInitCtx related tracer (#2138) by Jiarui Fang
- [Gemini] update API of the chunkmemstatscollector. (#2129) by Jiarui Fang
- [Gemini] update the non model data record method in runtime memory tracer (#2128) by Jiarui Fang
- [Gemini] test step-tensor mapping using repeated_computed_layers.py (#2127) by Jiarui Fang
- [Gemini] update non model data calculation method (#2126) by Jiarui Fang
- [Gemini] hotfix the unittest bugs (#2125) by Jiarui Fang
- [Gemini] mapping of preop timestep and param (#2124) by Jiarui Fang
- [Gemini] chunk init using runtime visited param order (#2115) by Jiarui Fang
- [Gemini] chunk init use OrderedParamGenerator (#2110) by Jiarui Fang
Nfc
- [NFC] remove useless graph node code (#2150) by Jiarui Fang
- [NFC] update chunk manager API (#2119) by Jiarui Fang
- [NFC] polish comments for Chunk class (#2116) by Jiarui Fang
Autoparallel
- [autoparallel] process size nodes in runtime pass (#2130) by YuliangLiu0306
- [autoparallel] implement softmax handler (#2132) by YuliangLiu0306
- [autoparallel] gpt2lp runtimee test (#2113) by YuliangLiu0306
Example
- Merge pull request #2120 from Fazziekey/example/stablediffusion-v2 by Fazzie-Maqianli
Optimizer
Pp middleware
- [PP Middleware] Add bwd and step for PP middleware (#2111) by Ziyue Jiang
Full Changelog: v0.1.13...v0.1.12
Version v0.1.12 Release Today!
What's Changed
Zero
Gemini
- [gemini] get the param visited order during runtime (#2108) by Jiarui Fang
- [Gemini] NFC, polish search_chunk_configuration (#2107) by Jiarui Fang
- [Gemini] gemini use the runtime memory tracer (RMT) (#2099) by Jiarui Fang
- [Gemini] make RuntimeMemTracer work correctly (#2096) by Jiarui Fang
- [Gemini] remove eval in gemini unittests! (#2092) by Jiarui Fang
- [Gemini] remove GLOBAL_MODEL_DATA_TRACER (#2091) by Jiarui Fang
- [Gemini] remove GLOBAL_CUDA_MEM_INFO (#2090) by Jiarui Fang
- [Gemini] use MemStats in Runtime Memory tracer (#2088) by Jiarui Fang
- [Gemini] use MemStats to store the tracing data. Seperate it from Collector. (#2084) by Jiarui Fang
- [Gemini] remove static tracer (#2083) by Jiarui Fang
- [Gemini] ParamOpHook -> ColoParamOpHook (#2080) by Jiarui Fang
- [Gemini] polish runtime tracer tests (#2077) by Jiarui Fang
- [Gemini] rename hooks related to runtime mem tracer (#2076) by Jiarui Fang
- [Gemini] add albert in test models. (#2075) by Jiarui Fang
- [Gemini] rename ParamTracerWrapper -> RuntimeMemTracer (#2073) by Jiarui Fang
- [Gemini] remove not used MemtracerWrapper (#2072) by Jiarui Fang
- [Gemini] fix grad unreleased issue and param recovery issue (#2052) by Zihao
Hotfix
- [hotfix] fix a type in ColoInitContext (#2106) by Jiarui Fang
- [hotfix] update test for latest version (#2060) by YuliangLiu0306
- [hotfix] skip gpt tracing test (#2064) by YuliangLiu0306
Colotensor
- [ColoTensor] throw error when ColoInitContext meets meta parameter. (#2105) by Jiarui Fang
Autoparallel
- [autoparallel] support linear function bias addition (#2104) by YuliangLiu0306
- [autoparallel] support addbmm computation (#2102) by YuliangLiu0306
- [autoparallel] add sum handler (#2101) by YuliangLiu0306
- [autoparallel] add bias addtion function class (#2098) by YuliangLiu0306
- [autoparallel] complete gpt related module search (#2097) by YuliangLiu0306
- [autoparallel]add embedding handler (#2089) by YuliangLiu0306
- [autoparallel] add tensor constructor handler (#2082) by YuliangLiu0306
- [autoparallel] add non_split linear strategy (#2078) by YuliangLiu0306
- [autoparallel] Add F.conv metainfo (#2069) by Boyuan Yao
- [autoparallel] complete gpt block searching (#2065) by YuliangLiu0306
- [autoparallel] add binary elementwise metainfo for auto parallel (#2058) by Boyuan Yao
- [autoparallel] fix forward memory calculation (#2062) by Boyuan Yao
- [autoparallel] adapt solver with self attention (#2037) by YuliangLiu0306
Version
- [version] 0.1.11rc5 -> 0.1.12 (#2103) by Jiarui Fang
Pipeline middleware
- [Pipeline Middleware] fix data race in Pipeline Scheduler for DAG (#2087) by Ziyue Jiang
- [Pipeline Middleware] Adapt scheduler for Topo (#2066) by Ziyue Jiang
Fx
- [fx] An experimental version of ColoTracer.' (#2002) by Super Daniel
Example
Device
- [device] update flatten device mesh usage (#2079) by YuliangLiu0306
Test
- [test] bert test in non-distributed way (#2074) by Jiarui Fang
Pipeline
- [Pipeline] Add Topo Class (#2059) by Ziyue Jiang
Examples
- [examples] update autoparallel demo (#2061) by YuliangLiu0306
Full Changelog: v0.1.12...v0.1.11rc5
Version v0.1.11rc5 Release Today!
What's Changed
Release
Cli
Gemini
- [gemini] fix init bugs for modules (#2047) by HELSON
- [gemini] add arguments (#2046) by HELSON
- [Gemini] free and allocate cuda memory by tensor.storage, add grad hook (#2040) by Zihao
- [Gemini] more tests for Gemini (#2038) by Jiarui Fang
- [Gemini] more rigorous unit tests for run_fwd_bwd (#2034) by Jiarui Fang
- [Gemini] paramWrapper paramTracerHook unitest (#2030) by Zihao
- [Gemini] patch for supporting orch.add_ function for ColoTensor (#2003) by Jiarui Fang
- [gemini] param_trace_hook (#2020) by Zihao
- [Gemini] add unitests to check gemini correctness (#2015) by Jiarui Fang
- [Gemini] ParamMemHook (#2008) by Zihao
- [Gemini] param_tracer_wrapper and test case (#2009) by Zihao
Setup
Test
- [test] align model name with the file name. (#2045) by Jiarui Fang
Hotfix
- [hotfix] hotfix Gemini for no leaf modules bug (#2043) by Jiarui Fang
- [hotfix] add bert test for gemini fwd bwd (#2035) by Jiarui Fang
- [hotfix] revert bug PRs (#2016) by Jiarui Fang
Zero
- [zero] fix testing parameters (#2042) by HELSON
- [zero] fix unit-tests (#2039) by HELSON
- [zero] test gradient accumulation (#1964) by HELSON
Testing
Rpc
- [rpc] split with dag (#2028) by Ziyue Jiang
Autoparallel
- [autoparallel] add split handler (#2032) by YuliangLiu0306
- [autoparallel] add experimental permute handler (#2029) by YuliangLiu0306
- [autoparallel] add runtime pass and numerical test for view handler (#2018) by YuliangLiu0306
- [autoparallel] add experimental view handler (#2011) by YuliangLiu0306
- [autoparallel] mix gather (#1977) by Genghan Zhang
Fx
- [fx]Split partition with DAG information (#2025) by Ziyue Jiang
Github
- [GitHub] update issue template (#2023) by binmakeswell
Workflow
Full Changelog: v0.1.11rc5...v0.1.11rc4
Version v0.1.11rc4 Release Today!
What's Changed
Workflow
- [workflow] fixed the python and cpu arch mismatch (#2010) by Frank Lee
- [workflow] fixed the typo in condarc (#2006) by Frank Lee
- [workflow] added conda cache and fixed no-compilation bug in release (#2005) by Frank Lee
Gemini
- [Gemini] add an inline_op_module to common test models and polish unitests. (#2004) by Jiarui Fang
- [Gemini] open grad checkpoint when model building (#1984) by Jiarui Fang
- [Gemini] add bert for MemtracerWrapper unintests (#1982) by Jiarui Fang
- [Gemini] MemtracerWrapper unittests (#1981) by Jiarui Fang
- [Gemini] memory trace hook (#1978) by Jiarui Fang
- [Gemini] independent runtime tracer (#1974) by Jiarui Fang
- [Gemini] ZeROHookV2 -> GeminiZeROHook (#1972) by Jiarui Fang
- [Gemini] clean no used MemTraceOp (#1970) by Jiarui Fang
- [Gemini] polish memstats collector (#1962) by Jiarui Fang
- [Gemini] add GeminiAdamOptimizer (#1960) by Jiarui Fang
Autoparallel
- [autoparallel] Add metainfo support for F.linear (#1987) by Boyuan Yao
- [autoparallel] use pytree map style to process data (#1989) by YuliangLiu0306
- [autoparallel] adapt handlers with attention block (#1990) by YuliangLiu0306
- [autoparallel] support more flexible data type (#1967) by YuliangLiu0306
- [autoparallel] add pooling metainfo (#1968) by Boyuan Yao
- [autoparallel] support distributed dataloader option (#1906) by YuliangLiu0306
- [autoparallel] Add alpha beta (#1973) by Genghan Zhang
- [autoparallel] add torch.nn.ReLU metainfo (#1868) by Boyuan Yao
- [autoparallel] support addmm in tracer and solver (#1961) by YuliangLiu0306
- [autoparallel] remove redundancy comm node (#1893) by YuliangLiu0306
Fx
- [fx] add more meta_registry for MetaTensor execution. (#2000) by Super Daniel
Hotfix
- [hotfix] make Gemini work for conv DNN (#1998) by Jiarui Fang
Example
- [example] add diffusion inference (#1986) by Fazzie-Maqianli
- [example] enhance GPT demo (#1959) by Jiarui Fang
- [example] add vit (#1942) by Jiarui Fang
Kernel
Polish
- [polish] remove useless file _mem_tracer_hook.py (#1963) by Jiarui Fang
Zero
Colotensor
- [ColoTensor] reconfig ColoInitContext, decouple default_pg and default_dist_spec. (#1953) by Jiarui Fang
- [ColoTensor] ColoInitContext initialize parameters in shard mode. (#1937) by Jiarui Fang
Tutorial
- [tutorial] polish all README (#1946) by binmakeswell
- [tutorial] added missing dummy dataloader (#1944) by Frank Lee
- [tutorial] fixed pipeline bug for sequence parallel (#1943) by Frank Lee
Tensorparallel
Sc demo
- [sc demo] add requirements to spmd README (#1941) by YuliangLiu0306
Sc
- [SC] remove redundant hands on (#1939) by Boyuan Yao
Full Changelog: v0.1.11rc4...v0.1.11rc3
Version v0.1.11rc3 Release Today!
What's Changed
Release
Tutorial
- [tutorial] polish README and OPT files (#1930) by binmakeswell
- [tutorial] add synthetic dataset for opt (#1924) by ver217
- [tutorial] updated hybrid parallel readme (#1928) by Frank Lee
- [tutorial] added synthetic data for sequence parallel (#1927) by Frank Lee
- [tutorial] removed huggingface model warning (#1925) by Frank Lee
- Hotfix/tutorial readme index (#1922) by Frank Lee
- [tutorial] modify hands-on of auto activation checkpoint (#1920) by Boyuan Yao
- [tutorial] added synthetic data for hybrid parallel (#1921) by Frank Lee
- [tutorial] added synthetic data for hybrid parallel (#1919) by Frank Lee
- [tutorial] added synthetic dataset for auto parallel demo (#1918) by Frank Lee
- [tutorial] updated auto parallel demo with latest data path (#1917) by Frank Lee
- [tutorial] added data script and updated readme (#1916) by Frank Lee
- [tutorial] add cifar10 for diffusion (#1907) by binmakeswell
- [tutorial] removed duplicated tutorials (#1904) by Frank Lee
- [tutorial] edited hands-on practices (#1899) by BoxiangW
Example
- [example] update auto_parallel img path (#1910) by binmakeswell
- [example] add cifar10 dadaset for diffusion (#1902) by Fazzie-Maqianli
- [example] migrate diffusion and auto_parallel hands-on (#1871) by binmakeswell
- [example] initialize tutorial (#1865) by binmakeswell
- Merge pull request #1842 from feifeibear/jiarui/polish by Fazzie-Maqianli
- [example] polish diffusion readme by jiaruifang
Sc
- [SC] add GPT example for auto checkpoint (#1889) by Boyuan Yao
- [sc] add examples for auto checkpoint. (#1880) by Super Daniel
Nfc
- [NFC] polish colossalai/amp/naive_amp/init.py code style (#1905) by Junming Wu
- [NFC] remove redundant dependency (#1869) by binmakeswell
- [NFC] polish .github/workflows/scripts/build_colossalai_wheel.py code style (#1856) by yuxuan-lou
- [NFC] polish .github/workflows/scripts/generate_release_draft.py code style (#1855) by Ofey Chan
- [NFC] polish workflows code style (#1854) by Kai Wang (Victor Kai)
- [NFC] polish colossalai/amp/apex_amp/init.py code style (#1853) by LuGY
- [NFC] polish .readthedocs.yaml code style (#1852) by nuszzh
- [NFC] polish <.github/workflows/release_nightly.yml> code style (#1851) by RichardoLuo
- [NFC] polish amp.naive_amp.grad_scaler code style by zbian
- [NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/op_handler/operator_handler.py code style (#1845) by HELSON
- [NFC] polish ./colossalai/amp/torch_amp/init.py code style (#1836) by Genghan Zhang
- [NFC] polish .github/workflows/build.yml code style (#1837) by xyupeng
- [NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/op_handler/conv_handler.py code style (#1829) by Sze-qq
- [NFC] polish colossalai/amp/torch_amp/_grad_scaler.py code style (#1823) by Ziyue Jiang
- [NFC] polish .github/workflows/release_docker.yml code style by Maruyama_Aya
- [NFC] polish .github/workflows/submodule.yml code style (#1822) by shenggan
- [NFC] polish .github/workflows/draft_github_release_post.yml code style (#1820) by Arsmart1
- [NFC] polish colossalai/amp/naive_amp/_fp16_optimizer.py code style (#1819) by Fazzie-Maqianli
- [NFC] polish colossalai/amp/naive_amp/_utils.py code style (#1816) by CsRic
- [NFC] polish .github/workflows/build_gpu_8.yml code style (#1813) by Zangwei Zheng
- [NFC] polish MANIFEST.in code style (#1814) by Zirui Zhu
- [NFC] polish strategies_constructor.py code style (#1806) by binmakeswell
Doc
- [doc] add news (#1901) by binmakeswell
Zero
Autoparallel
- [autoparallel] user-friendly API for CheckpointSolver. (#1879) by Super Daniel
- [autoparallel] fix linear logical convert issue (#1857) by YuliangLiu0306
Fx
- [fx] metainfo_trace as an API. (#1873) by Super Daniel
Hotfix
- [hotfix] pass test_complete_workflow (#1877) by Jiarui Fang
Inference
- [inference] overlap comm and compute in Linear1D_Row when stream_chunk_num > 1 (#1876) by Jiarui Fang
- [inference] streaming Linear 1D Row inference (#1874) by Jiarui Fang
Amp
Diffusion
Utils
- [utils] fixed lazy init context (#1867) by Frank Lee
- [utils] remove lazy_memory_allocate from ColoInitContext (#1844) by Jiarui Fang
Full Changelog: v0.1.11rc3...v0.1.11rc2
Version v0.1.11rc2 Release Today!
What's Changed
Autoparallel
- [autoparallel] fix bugs caused by negative dim key (#1808) by YuliangLiu0306
- [autoparallel] fix bias addition module (#1800) by YuliangLiu0306
- [autoparallel] add batch norm metainfo (#1815) by Boyuan Yao
- [autoparallel] add conv metainfo class for auto parallel (#1796) by Boyuan Yao
- [autoparallel]add essential CommActions for broadcast oprands (#1793) by YuliangLiu0306
- [autoparallel] refactor and add rotorc. (#1789) by Super Daniel
- [autoparallel] add getattr handler (#1767) by YuliangLiu0306
- [autoparallel] added matmul handler (#1763) by Frank Lee
- [autoparallel] fix conv handler numerical test (#1771) by YuliangLiu0306
- [autoparallel] move ckpt solvers to autoparallel folder / refactor code (#1764) by Super Daniel
- [autoparallel] add numerical test for handlers (#1769) by YuliangLiu0306
- [autoparallel] update CommSpec to CommActions (#1768) by YuliangLiu0306
- [autoparallel] add numerical test for node strategies (#1760) by YuliangLiu0306
- [autoparallel] refactor the runtime apply pass and add docstring to passes (#1757) by YuliangLiu0306
- [autoparallel] added binary elementwise node handler (#1758) by Frank Lee
- [autoparallel] fix param hook issue in transform pass (#1755) by YuliangLiu0306
- [autoparallel] added addbmm handler (#1751) by Frank Lee
- [autoparallel] shard param and buffer as expected (#1753) by YuliangLiu0306
- [autoparallel] add sequential order to communication actions (#1735) by YuliangLiu0306
- [autoparallel] recovered skipped test cases (#1748) by Frank Lee
- [autoparallel] fixed wrong sharding strategy in conv handler (#1747) by Frank Lee
- [autoparallel] fixed wrong generated strategy for dot op (#1746) by Frank Lee
- [autoparallel] handled illegal sharding strategy in shape consistency (#1744) by Frank Lee
- [autoparallel] handled illegal strategy in node handler (#1743) by Frank Lee
- [autoparallel] handled illegal sharding strategy (#1728) by Frank Lee
Kernel
- [kernel] added jit warmup (#1792) by アマデウス
- [kernel] more flexible flashatt interface (#1804) by oahzxl
- [kernel] skip tests of flash_attn and triton when they are not available (#1798) by Jiarui Fang
Gemini
- [Gemini] make gemini usage simple (#1821) by Jiarui Fang
Checkpointio
Doc
- [doc] polish diffusion README (#1840) by binmakeswell
- [doc] remove obsolete API demo (#1833) by binmakeswell
- [doc] add diffusion (#1827) by binmakeswell
- [doc] add FastFold (#1766) by binmakeswell
Example
- [example] remove useless readme in diffusion (#1831) by Jiarui Fang
- [example] add TP to GPT example (#1828) by Jiarui Fang
- [example] add stable diffuser (#1825) by Fazzie-Maqianli
- [example] simplify the GPT2 huggingface example (#1826) by Jiarui Fang
- [example] opt does not depend on Titans (#1811) by Jiarui Fang
- [example] add GPT by Jiarui Fang
- [example] add opt model in lauguage (#1809) by Jiarui Fang
- [example] add diffusion to example (#1805) by Jiarui Fang
Nfc
- [NFC] update gitignore remove DS_Store (#1830) by Jiarui Fang
- [NFC] polish type hint for shape consistency (#1801) by Jiarui Fang
- [NFC] polish tests/test_layers/test_3d/test_3d.py code style (#1740) by Ziheng Qin
- [NFC] polish tests/test_layers/test_3d/checks_3d/common.py code style (#1733) by lucasliunju
- [NFC] polish colossalai/nn/metric/_utils.py code style (#1727) by Sze-qq
- [NFC] polish tests/test_layers/test_3d/checks_3d/check_layer_3d.py code style (#1731) by Xue Fuzhao
- [NFC] polish tests/test_layers/test_sequence/checks_seq/check_layer_seq.py code style (#1723) by xyupeng
- [NFC] polish accuracy_2d.py code style (#1719) by Ofey Chan
- [NFC] polish .github/workflows/scripts/build_colossalai_wheel.py code style (#1721) by Arsmart1
- [NFC] polish _checkpoint_hook.py code style (#1722) by LuGY
- [NFC] polish test_2p5d/checks_2p5d/check_operation_2p5d.py code style (#1718) by Kai Wang (Victor Kai)
- [NFC] polish colossalai/zero/sharded_param/init.py code style (#1717) by CsRic
- [NFC] polish colossalai/nn/lr_scheduler/linear.py code style (#1716) by yuxuan-lou
- [NFC] polish tests/test_layers/test_2d/checks_2d/check_operation_2d.py code style (#1715) by binmakeswell
- [NFC] polish colossalai/nn/metric/accuracy_2p5d.py code style (#1714) by shenggan
Fx
- [fx] add a symbolic_trace api. (#1812) by Super Daniel
- [fx] skip diffusers unitest if it is not installed (#1799) by Jiarui Fang
- [fx] Add linear metainfo class for auto parallel (#1783) by Boyuan Yao
- [fx] support module with bias addition (#1780) by YuliangLiu0306
- [fx] refactor memory utils and extend shard utils. (#1754) by Super Daniel
- [fx] test tracer on diffuser modules. (#1750) by Super Daniel
Hotfix
- [hotfix] fix build error when torch version >= 1.13 (#1803) by xcnick
- [hotfix] polish flash attention (#1802) by oahzxl
- [hotfix] fix zero's incompatibility with checkpoint in torch-1.12 (#1786) by HELSON
- [hotfix] polish chunk import (#1787) by Jiarui Fang
- [hotfix] autoparallel unit test (#1752) by YuliangLiu0306
Pipeline
- [Pipeline]Adapt to Pipelinable OPT (#1782) by Ziyue Jiang
Ci
- [CI] downgrade fbgemm. (#1778) by Super Daniel
Compatibility
- [compatibility] ChunkMgr import error (#1772) by Jiarui Fang
Feat
Fx/profiler
- [fx/profiler] debug the fx.profiler / add an example test script for fx.profiler (#1730) by Super Daniel
Workflow
Full Changelog: v0.1.11rc2...v0.1.11rc1
Version v0.1.11rc1 Release Today!
What's Changed
Hotfix
- [hotfix] resharding cost issue (#1742) by YuliangLiu0306
- [hotfix] solver bug caused by dict type comm cost (#1686) by YuliangLiu0306
- [hotfix] fix wrong type name in profiler (#1678) by Boyuan Yao
- [hotfix]unit test (#1670) by YuliangLiu0306
- [hotfix] add recompile after graph manipulatation (#1621) by YuliangLiu0306
- [hotfix] got sliced types (#1614) by YuliangLiu0306
Release
Doc
- [doc] update recommendation system catalogue (#1732) by binmakeswell
- [doc] update recommedation system urls (#1725) by Jiarui Fang
Zero
- [zero] add chunk init function for users (#1729) by HELSON
- [zero] add constant placement policy (#1705) by HELSON
Pre-commit
Autoparallel
- [autoparallel] runtime_backward_apply (#1720) by YuliangLiu0306
- [autoparallel] moved tests to test_tensor_shard (#1713) by Frank Lee
- [autoparallel] resnet block runtime apply (#1709) by YuliangLiu0306
- [autoparallel] fixed broken node handler tests (#1708) by Frank Lee
- [autoparallel] refactored the autoparallel module for organization (#1706) by Frank Lee
- [autoparallel] adapt runtime passes (#1703) by YuliangLiu0306
- [autoparallel] collated all deprecated files (#1700) by Frank Lee
- [autoparallel] init new folder structure (#1696) by Frank Lee
- [autoparallel] adapt solver and CostGraph with new handler (#1695) by YuliangLiu0306
- [autoparallel] add output handler and placeholder handler (#1694) by YuliangLiu0306
- [autoparallel] add pooling handler (#1690) by YuliangLiu0306
- [autoparallel] where_handler_v2 (#1688) by YuliangLiu0306
- [autoparallel] fix C version rotor inconsistency (#1691) by Boyuan Yao
- [autoparallel] added sharding spec conversion for linear handler (#1687) by Frank Lee
- [autoparallel] add reshape handler v2 and fix some previous bug (#1683) by YuliangLiu0306
- [autoparallel] add unary element wise handler v2 (#1674) by YuliangLiu0306
- [autoparallel] add following node generator (#1673) by YuliangLiu0306
- [autoparallel] add layer norm handler v2 (#1671) by YuliangLiu0306
- [autoparallel] fix insecure subprocess (#1680) by Boyuan Yao
- [autoparallel] add rotor C version (#1658) by Boyuan Yao
- [autoparallel] added utils for broadcast operation (#1665) by Frank Lee
- [autoparallel] update CommSpec (#1667) by YuliangLiu0306
- [autoparallel] added bias comm spec to matmul strategy (#1664) by Frank Lee
- [autoparallel] add batch norm handler v2 (#1666) by YuliangLiu0306
- [autoparallel] remove no strategy nodes (#1652) by YuliangLiu0306
- [autoparallel] added compute resharding costs for node handler (#1662) by Frank Lee
- [autoparallel] added new strategy constructor template (#1661) by Frank Lee
- [autoparallel] added node handler for bmm (#1655) by Frank Lee
- [autoparallel] add conv handler v2 (#1663) by YuliangLiu0306
- [autoparallel] adapt solver with gpt (#1653) by YuliangLiu0306
- [autoparallel] implemented all matmul strategy generator (#1650) by Frank Lee
- [autoparallel] change the following nodes strategies generation logic (#1636) by YuliangLiu0306
- [autoparallel] where handler (#1651) by YuliangLiu0306
- [autoparallel] implemented linear projection strategy generator (#1639) by Frank Lee
- [autoparallel] adapt solver with mlp (#1638) by YuliangLiu0306
- [autoparallel] Add pofo sequence annotation (#1637) by Boyuan Yao
- [autoparallel] add elementwise handler (#1622) by YuliangLiu0306
- [autoparallel] add embedding handler (#1620) by YuliangLiu0306
- [autoparallel] protect bcast handler from invalid strategies (#1631) by YuliangLiu0306
- [autoparallel] add layernorm handler (#1629) by YuliangLiu0306
- [autoparallel] recover the merged node strategy index (#1613) by YuliangLiu0306
- [autoparallel] added new linear module handler (#1616) by Frank Lee
- [autoparallel] added new node handler (#1612) by Frank Lee
- [autoparallel]add bcast matmul strategies (#1605) by YuliangLiu0306
- [autoparallel] refactored the data structure for sharding strategy (#1610) by Frank Lee
- [autoparallel] add bcast op handler (#1600) by YuliangLiu0306
- [autoparallel] added all non-bcast matmul strategies (#1603) by Frank Lee
- [autoparallel] added strategy generator and bmm strategies (#1602) by Frank Lee
- [autoparallel] add reshape handler (#1594) by YuliangLiu0306
- [autoparallel] refactored shape consistency to remove redundancy (#1591) by Frank Lee
- [autoparallel] add resnet autoparallel unit test and add backward weight communication cost (#1589) by YuliangLiu0306
- [autoparallel] added generate_sharding_spec to utils (#1590) by Frank Lee
- [autoparallel] added solver option dataclass (#1588) by Frank Lee
- [autoparallel] adapt solver with resnet (#1583) by YuliangLiu0306
Fx/meta/rpc
- [fx/meta/rpc] move _meta_registration.py to fx folder / register fx functions with compatibility checks / remove color debug (#1710) by Super Daniel
Embeddings
- [embeddings] add doc in readme (#1711) by Jiarui Fang
- [embeddings] more detailed timer (#1692) by Jiarui Fang
- [embeddings] cache option (#1635) by Jiarui Fang
- [embeddings] use cache_ratio instead of cuda_row_num (#1611) by Jiarui Fang
- [embeddings] add already_split_along_rank flag for tablewise mode (#1584) by CsRic
Unittest
- [unittest] added doc for the pytest wrapper (#1704) by Frank Lee
- [unittest] supported condititonal testing based on env var (#1701) by Frank Lee
Embedding
- [embedding] rename FreqAwareEmbedding -> CachedEmbedding (#1699) by Jiarui Fang
- [embedding] polish async copy (#1657) by Jiarui Fang
- [embedding] add more detail profiling (#1656) by Jiarui Fang
- [embedding] print profiling results (#1654) by Jiarui Fang
- [embedding] non-blocking cpu-gpu copy (#1647) by Jiarui Fang
- [embedding] isolate cache_op from forward (#1645) by CsRic
- [embedding] rollback for better FAW performance (#1625) by Jiarui Fang
- [embedding] updates some default parameters by Jiarui Fang
Fx/profiler
- [fx/profiler] assigned UUID to each unrecorded tensor/ improved performance on GPT-2 (#1679) by Super Daniel
- [fx/profiler] provide a table of sum...
Version v0.1.10 Release Today!
What's Changed
Embedding
- [embedding] cache_embedding small improvement (#1564) by CsRic
- [embedding] polish parallel embedding tablewise (#1545) by Jiarui Fang
- [embedding] freq_aware_embedding: add small functions for caller application (#1537) by CsRic
- [embedding] fix a bug in table wise sharding (#1538) by Jiarui Fang
- [embedding] tablewise sharding polish (#1535) by Jiarui Fang
- [embedding] add tablewise sharding for FAW (#1526) by CsRic
Nfc
- [NFC] polish test component gpt code style (#1567) by アマデウス
- [NFC] polish doc style for ColoTensor (#1457) by Jiarui Fang
- [NFC] global vars should be upper case (#1456) by Jiarui Fang
Pipeline/tuning
- [pipeline/tuning] improve dispatch performance both time and space cost (#1544) by Kirigaya Kazuto
Fx
- [fx] provide a stable but not accurate enough version of profiler. (#1547) by Super Daniel
- [fx] Add common node in model linearize (#1542) by Boyuan Yao
- [fx] support meta tracing for aten level computation graphs like functorch. (#1536) by Super Daniel
- [fx] Modify solver linearize and add corresponding test (#1531) by Boyuan Yao
- [fx] add test for meta tensor. (#1527) by Super Daniel
- [fx]patch nn.functional convolution (#1528) by YuliangLiu0306
- [fx] Fix wrong index in annotation and minimal flops in ckpt solver (#1521) by Boyuan Yao
- [fx] hack torch_dispatch for meta tensor and autograd. (#1515) by Super Daniel
- [fx] Fix activation codegen dealing with checkpointing first op (#1510) by Boyuan Yao
- [fx] fix the discretize bug (#1506) by Boyuan Yao
- [fx] fix wrong variable name in solver rotor (#1502) by Boyuan Yao
- [fx] Add activation checkpoint solver rotor (#1496) by Boyuan Yao
- [fx] add more op patches for profiler and error message for unsupported ops. (#1495) by Super Daniel
- [fx] fixed adapative pooling size concatenation error (#1489) by Frank Lee
- [fx] add profiler for fx nodes. (#1480) by Super Daniel
- [fx] Fix ckpt functions' definitions in forward (#1476) by Boyuan Yao
- [fx] fix MetaInfoProp for incorrect calculations and add detections for inplace op. (#1466) by Super Daniel
- [fx] add rules to linearize computation graphs for searching. (#1461) by Super Daniel
- [fx] Add use_reentrant=False to checkpoint in codegen (#1463) by Boyuan Yao
- [fx] fix test and algorithm bugs in activation checkpointing. (#1451) by Super Daniel
- [fx] Use colossalai checkpoint and add offload recognition in codegen (#1439) by Boyuan Yao
- [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174. (#1446) by Super Daniel
Autoparallel
- [autoparallel]add backward cost info into strategies (#1524) by YuliangLiu0306
- [autoparallel] support fucntion in operator handler (#1529) by YuliangLiu0306
- [autoparallel] change the merge node logic (#1533) by YuliangLiu0306
- [autoparallel] added liveness analysis (#1516) by Frank Lee
- [autoparallel] add more sharding strategies to conv (#1487) by YuliangLiu0306
- [autoparallel] add cost graph class (#1481) by YuliangLiu0306
- [autoparallel] added namespace constraints (#1490) by Frank Lee
- [autoparallel] integrate auto parallel with torch fx (#1479) by Frank Lee
- [autoparallel] added dot handler (#1475) by Frank Lee
- [autoparallel] introduced baseclass for op handler and reduced code redundancy (#1471) by Frank Lee
- [autoparallel] standardize the code structure (#1469) by Frank Lee
- [autoparallel] Add conv handler to generate strategies and costs info for conv (#1467) by YuliangLiu0306
Utils
- [utils] refactor parallel layers checkpoint and bcast model on loading checkpoint (#1548) by ver217
- [utils] optimize partition_tensor_parallel_state_dict (#1546) by ver217
- [utils] Add use_reetrant=False in utils.activation_checkpoint (#1460) by Boyuan Yao
- [utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer (#1442) by ver217
Hotfix
- [hotfix] change namespace for meta_trace. (#1541) by Super Daniel
- [hotfix] fix init context (#1543) by ver217
- [hotfix] avoid conflict of meta registry with torch 1.13.0. (#1530) by Super Daniel
- [hotfix] fix coloproxy typos. (#1519) by Super Daniel
Pipeline/pipleline_process_group
- [pipeline/pipleline_process_group] finish PipelineProcessGroup to manage local abd global rank in TP,DP and PP (#1508) by Kirigaya Kazuto
Doc
- [doc] docstring for FreqAwareEmbeddingBag (#1525) by Jiarui Fang
- [doc] update readme with the new xTrimoMultimer project (#1477) by Sze-qq
- [doc] update docstring in ProcessGroup (#1468) by Jiarui Fang
- [Doc] add more doc for ColoTensor. (#1458) by Jiarui Fang
Autoparellel
- [autoparellel]add strategies constructor (#1505) by YuliangLiu0306
Faw
- [FAW] cpu caching operations (#1520) by Jiarui Fang
- [FAW] refactor reorder() for CachedParamMgr (#1514) by Jiarui Fang
- [FAW] LFU initialize with dataset freq (#1513) by Jiarui Fang
- [FAW] shrink freq_cnter size (#1509) by CsRic
- [FAW] remove code related to chunk (#1501) by Jiarui Fang
- [FAW] add more docs and fix a warning (#1500) by Jiarui Fang
- [FAW] FAW embedding use LRU as eviction strategy intialized with dataset stats (#1494) by CsRic
- [FAW] LFU cache for the FAW by CsRic
- [FAW] init an LFU implementation for FAW (#1488) by Jiarui Fang
- [FAW] reorganize the inheritance struct of FreqCacheEmbedding (#1448) by Geng Zhang
Pipeline/rpc
- [pipeline/rpc] update outstanding mechanism | optimize dispatching strategy (#1497) by Kirigaya Kazuto
- [pipeline/rpc] implement distributed optimizer | test with assert_close (#1486) by Kirigaya Kazuto
- [pipeline/rpc] support interleaving | fix checkpoint bug | change logic when dispatch data in work_list to ensure steady 1F1B (#1483) by Kirigaya Kazuto
- [pipeline/rpc] implement a demo for PP with cuda rpc framework (#1470) by Kirigaya Kazuto
Tensor
- [tensor]add 1D device mesh (#1492) by YuliangLiu0306
- [tensor] support runtime ShardingSpec apply (#1453) by YuliangLiu0306
- [tensor] shape consistency generate transform path and communication cost (#1435) by YuliangLiu0306
- [tensor] added linear implementation for the new sharding spec (#1416) by Frank Lee
Fce
- [FCE] update interface for frequency statistics in FreqCacheEmbedding (#1462) by Geng Zhang
Workflow
Test
Engin/schedule
- [engin/schedule] use p2p_v2 to ...
Version v0.1.9 Release Today!
What's Changed
Zero
- [zero] add chunk_managerV2 for all-gather chunk (#1441) by HELSON
- [zero] add chunk size searching algorithm for parameters in different groups (#1436) by HELSON
- [zero] add has_inf_or_nan in AgChunk; enhance the unit test of AgChunk (#1426) by HELSON
- [zero] add unit test for AgChunk's append, close, access (#1423) by HELSON
- [zero] add AgChunk (#1417) by HELSON
- [zero] ZeroDDP supports controlling outputs' dtype (#1399) by ver217
- [zero] alleviate memory usage in ZeRODDP state_dict (#1398) by HELSON
- [zero] chunk manager allows filtering ex-large params (#1393) by ver217
- [zero] zero optim state_dict takes only_rank_0 (#1384) by ver217
Fx
- [fx] add vanilla activation checkpoint search with test on resnet and densenet (#1433) by Super Daniel
- [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages (#1425) by Super Daniel
- [fx] fixed torchaudio conformer tracing (#1392) by Frank Lee
- [fx] patched torch.max and data movement operator (#1391) by Frank Lee
- [fx] fixed indentation error in checkpointing codegen (#1385) by Frank Lee
- [fx] patched torch.full for huggingface opt (#1386) by Frank Lee
- [fx] update split module pass and add customized policy (#1373) by YuliangLiu0306
- [fx] add torchaudio test (#1369) by Super Daniel
- [fx] Add colotracer compatibility test on torchrec (#1370) by Boyuan Yao
- [fx]add gpt2 passes for pipeline performance test (#1366) by YuliangLiu0306
- [fx] added activation checkpoint codegen support for torch < 1.12 (#1359) by Frank Lee
- [fx] added activation checkpoint codegen (#1355) by Frank Lee
- [fx] fixed apex normalization patch exception (#1352) by Frank Lee
- [fx] added activation checkpointing annotation (#1349) by Frank Lee
- [fx] update MetaInforProp pass to process more complex node.meta (#1344) by YuliangLiu0306
- [fx] refactor tracer to trace complete graph (#1342) by YuliangLiu0306
- [fx] tested the complete workflow for auto-parallel (#1336) by Frank Lee
- [fx]refactor tracer (#1335) by YuliangLiu0306
- [fx] recovered skipped pipeline tests (#1338) by Frank Lee
- [fx] fixed compatiblity issue with torch 1.10 (#1331) by Frank Lee
- [fx] fixed unit tests for torch 1.12 (#1327) by Frank Lee
- [fx] add balanced policy v2 (#1251) by YuliangLiu0306
- [fx] Add unit test and fix bugs for transform_mlp_pass (#1299) by XYE
- [fx] added apex normalization to patched modules (#1300) by Frank Lee
Recommendation System
- [FAW] export FAW in _ops (#1438) by Jiarui Fang
- [FAW] move coloparam setting in test code. (#1429) by Jiarui Fang
- [FAW] parallel FreqAwareEmbedding (#1424) by Jiarui Fang
- [FAW] add cache manager for the cached embedding (#1419) by Jiarui Fang
Global Tensor
- [tensor] add shape consistency feature to support auto spec transform (#1418) by YuliangLiu0306
- [tensor]build sharding spec to replace distspec in future. (#1405) by YuliangLiu0306
Hotfix
- [hotfix] zero optim prevents calling inner optim.zero_grad (#1422) by ver217
- [hotfix] fix CPUAdam kernel nullptr (#1410) by ver217
- [hotfix] adapt ProcessGroup and Optimizer to ColoTensor (#1388) by HELSON
- [hotfix] fix a running error in test_colo_checkpoint.py (#1387) by HELSON
- [hotfix] fix some bugs during gpt2 testing (#1379) by YuliangLiu0306
- [hotfix] fix zero optim save/load state dict (#1381) by ver217
- [hotfix] fix zero ddp buffer cast (#1376) by ver217
- [hotfix] fix no optimizer in save/load (#1363) by HELSON
- [hotfix] fix megatron_init in test_gpt2.py (#1357) by HELSON
- [hotfix] ZeroDDP use new process group (#1333) by ver217
- [hotfix] shared model returns cpu state_dict (#1328) by ver217
- [hotfix] fix ddp for unit test test_gpt2 (#1326) by HELSON
- [hotfix] fix unit test test_module_spec (#1321) by HELSON
- [hotfix] fix PipelineSharedModuleGradientHandler (#1314) by ver217
- [hotfix] fix ColoTensor GPT2 unitest (#1309) by HELSON
- [hotfix] add missing file (#1308) by Jiarui Fang
- [hotfix] remove potiential circle import (#1307) by Jiarui Fang
- [hotfix] skip some unittest due to CI environment. (#1301) by YuliangLiu0306
- [hotfix] fix shape error in backward when using ColoTensor (#1298) by HELSON
- [hotfix] Dist Mgr gather torch version (#1284) by Jiarui Fang
Communication
- [communication] add p2p_v2.py to support communication with List[Any] (#1407) by Kirigaya Kazuto
Device
- [device] add DeviceMesh class to support logical device layout (#1394) by YuliangLiu0306
Chunk
- [chunk] add PG check for tensor appending (#1383) by Jiarui Fang
DDP
Checkpoint
- [checkpoint] add kwargs for load_state_dict (#1374) by HELSON
- [checkpoint] use args, kwargs in save_checkpoint, load_checkpoint (#1368) by HELSON
- [checkpoint] sharded optim save/load grad scaler (#1350) by ver217
- [checkpoint] use gather_tensor in checkpoint and update its unit test (#1339) by HELSON
- [checkpoint] add ColoOptimizer checkpointing (#1316) by Jiarui Fang
- [checkpoint] add test for bert and hotfix save bugs (#1297) by Jiarui Fang
Util
Nvme
Colotensor
- [colotensor] use cpu memory to store state_dict (#1367) by HELSON
- [colotensor] add Tensor.view op and its unit test (#1343) by HELSON
Unit test
Docker
Doc
Refactor
Workflow
- [workflow] update docker build workflow to use proxy (#1334) by Frank Lee
- [workflow] update 8-gpu test to use torch 1.11 (#1332) by Frank Lee
- [workflow] roll back to use torch 1.11 for unit testing (#1325) by Frank Lee
- [workflow] fixed trigger condition for 8-gpu unit test (#1323) by Frank Lee
- [workflow] updated release bdist workflow (#1318) by Frank Lee
- [workflow] disable SHM for compatibility CI on rtx3080 (#1315) by Frank Lee
- [workflow] updated pytorch compatibility test (#1311) by Frank Lee
Test
- [test] removed outdated unit test for meta context (#1329) by [Frank Lee](https://api.github.com/users/Fra...
Version v0.1.8 Release Today!
What's Changed
Hotfix
- [hotfix] torchvison fx unittests miss import pytest (#1277) by Jiarui Fang
- [hotfix] fix an assertion bug in base schedule. (#1250) by YuliangLiu0306
- [hotfix] fix sharded optim step and clip_grad_norm (#1226) by ver217
- [hotfix] fx get comm size bugs (#1233) by Jiarui Fang
- [hotfix] fx shard 1d pass bug fixing (#1220) by Jiarui Fang
- [hotfix]fixed p2p process send stuck (#1181) by YuliangLiu0306
- [hotfix]different overflow status lead to communication stuck. (#1175) by YuliangLiu0306
- [hotfix]fix some bugs caused by refactored schedule. (#1148) by YuliangLiu0306
Tensor
- [tensor] distributed checkpointing for parameters (#1240) by Jiarui Fang
- [tensor] redistribute among different process groups (#1247) by Jiarui Fang
- [tensor] a shorter shard and replicate spec (#1245) by Jiarui Fang
- [tensor] redirect .data.get to a tensor instance (#1239) by HELSON
- [tensor] add zero_like colo op, important for Optimizer (#1236) by Jiarui Fang
- [tensor] fix some unittests (#1234) by Jiarui Fang
- [tensor] fix a assertion in colo_tensor cross_entropy (#1232) by HELSON
- [tensor] add unitest for colo_tensor 1DTP cross_entropy (#1230) by HELSON
- [tensor] torch function return colotensor (#1229) by Jiarui Fang
- [tensor] improve robustness of class 'ProcessGroup' (#1223) by HELSON
- [tensor] sharded global process group (#1219) by Jiarui Fang
- [Tensor] add cpu group to ddp (#1200) by Jiarui Fang
- [tensor] remove gpc in tensor tests (#1186) by Jiarui Fang
- [tensor] revert local view back (#1178) by Jiarui Fang
- [Tensor] rename some APIs in TensorSpec and Polish view unittest (#1176) by Jiarui Fang
- [Tensor] rename parallel_action (#1174) by Ziyue Jiang
- [Tensor] distributed view supports inter-process hybrid parallel (#1169) by Jiarui Fang
- [Tensor] remove ParallelAction, use ComputeSpec instread (#1166) by Jiarui Fang
- [tensor] add embedding bag op (#1156) by ver217
- [tensor] add more element-wise ops (#1155) by ver217
- [tensor] fixed non-serializable colo parameter during model checkpointing (#1153) by Frank Lee
- [tensor] dist spec s2s uses all-to-all (#1136) by ver217
- [tensor] added repr to spec (#1147) by Frank Lee
Fx
- [fx] added ndim property to proxy (#1253) by Frank Lee
- [fx] fixed tracing with apex-based T5 model (#1252) by Frank Lee
- [fx] refactored the file structure of patched function and module (#1238) by Frank Lee
- [fx] methods to get fx graph property. (#1246) by YuliangLiu0306
- [fx]add split module pass and unit test from pipeline passes (#1242) by YuliangLiu0306
- [fx] fixed huggingface OPT and T5 results misalignment (#1227) by Frank Lee
- [fx]get communication size between partitions (#1224) by YuliangLiu0306
- [fx] added patches for tracing swin transformer (#1228) by Frank Lee
- [fx] fixed timm tracing result misalignment (#1225) by Frank Lee
- [fx] added timm model tracing testing (#1221) by Frank Lee
- [fx] added torchvision model tracing testing (#1216) by Frank Lee
- [fx] temporarily used (#1215) by XYE
- [fx] added testing for all albert variants (#1211) by Frank Lee
- [fx] added testing for all gpt variants (#1210) by Frank Lee
- [fx]add uniform policy (#1208) by YuliangLiu0306
- [fx] added testing for all bert variants (#1207) by Frank Lee
- [fx] supported model tracing for huggingface bert (#1201) by Frank Lee
- [fx] added module patch for pooling layers (#1197) by Frank Lee
- [fx] patched conv and normalization (#1188) by Frank Lee
- [fx] supported data-dependent control flow in model tracing (#1185) by Frank Lee
Rename
- [rename] convert_to_dist -> redistribute (#1243) by Jiarui Fang
Checkpoint
- [checkpoint] save sharded optimizer states (#1237) by Jiarui Fang
- [checkpoint]support generalized scheduler (#1222) by Yi Zhao
- [checkpoint] make unitest faster (#1217) by Jiarui Fang
- [checkpoint] checkpoint for ColoTensor Model (#1196) by Jiarui Fang
Polish
Refactor
- [refactor] move process group from _DistSpec to ColoTensor. (#1203) by Jiarui Fang
- [refactor] remove gpc dependency in colotensor's _ops (#1189) by Jiarui Fang
- [refactor] move chunk and chunkmgr to directory gemini (#1182) by Jiarui Fang
Context
- [context]support arbitary module materialization. (#1193) by YuliangLiu0306
- [context]use meta tensor to init model lazily. (#1187) by YuliangLiu0306
Ddp
- [ddp] ColoDDP uses bucket all-reduce (#1177) by ver217
- [ddp] refactor ColoDDP and ZeroDDP (#1146) by ver217
Colotensor
- [ColoTensor] add independent process group (#1179) by Jiarui Fang
- [ColoTensor] rename APIs and add output_replicate to ComputeSpec (#1168) by Jiarui Fang
- [ColoTensor] improves init functions. (#1150) by Jiarui Fang
Zero
- [zero] sharded optim supports loading local state dict (#1170) by ver217
- [zero] zero optim supports loading local state dict (#1171) by ver217
Workflow
- [workflow] polish readme and dockerfile (#1165) by Frank Lee
- [workflow] auto-publish docker image upon release (#1164) by Frank Lee
- [workflow] fixed release post workflow (#1154) by Frank Lee
- [workflow] fixed format error in yaml file (#1145) by Frank Lee
- [workflow] added workflow to auto draft the release post (#1144) by Frank Lee
Gemini
Pipeline
- [pipeline]add customized policy (#1139) by YuliangLiu0306
- [pipeline]support more flexible pipeline (#1138) by YuliangLiu0306
Ci
Full Changelog: v0.1.8...v0.1.7