-
Notifications
You must be signed in to change notification settings - Fork 66
Closed
Labels
Description
Path to v1.2.0
- Ascend NPU Support: Context parallelism, FLUX/Qwen-Image @DefTruth @gameofdimension feat: support ascend npu #651 feat: add abstract platform #653
- introduce accumulated_rel_l1_diff to reduce accumulated cache error, using in official TeaCache and EasyCache. feat: support step compute mask #444
- Introduce LeMiCa/EasyCache style custom step compute mask, like: "111110100100000100000010000001", 1: Full compute, 0: dynamic/static cache (hybrid with a autotune function) @DefTruth feat: support step compute mask #444
- Context Parallelism for any tokens (any resolution, any prompt tokens) @DefTruth UAA: ulysses anything attn w/ zero overhead #462
- Support All Gather for any tokens (any resolution, any prompt tokens), for UAA @DefTruth feat: support unshard anything for UAA #465
- Optimize the performance of UAA while using torch.compile (due to the graph break intro by
if branch) feat: allow UAA in compiled graph #474 - Parallelize VAE @DefTruth @tingkuanpei feat: support 🔥vae parallelism #645
- Parallelize Text Encoder @gameofdimension @DefTruth feat: support TP for many text encoder #569
- Manually Compute and Comm overlap (Attention level or Model level) for Ulysses and UAA, e.g: AsyncUlyssesQKVProj @tingkuanpei @DefTruth
- Cache and Parallelism support for HunyuanVideo-1.5、FLUX.2、Z-Image @DefTruth @gameofdimension
- Fused Per Tensor FP8 All2All via triton/cuda kernel @DefTruth @triple-mu feat: support per_token_quant_fp8 triton kernel #524
- Any Head num support for Ulysses, e.g., Z-Image @DefTruth
- More CIs @DefTruth
- official readthedocs.io
- Performance benchmark, NVDIA A800, L20, NPU, etc. @DefTruth docs: update nvidia gpu benchmark #684
- GPU CIs: model tests ci: add basic gpu ci tests #688
- mkdocs CIs: check mkdocs build --strict @DefTruth CI: add check-mkdocs ci #680
Reactions are currently unavailable