deepspeedai / DeepSpeed Public

Notifications You must be signed in to change notification settings
Fork 4.2k
Star 36.6k

Code
Issues 996
Pull requests 104
Discussions
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Projects
Security
Insights

Issues: deepspeedai/DeepSpeed

[Roadmap] DeepSpeed Roadmap Q1 2025

#6946 opened Jan 13, 2025 by loadams

Open

Labels 30 Milestones 0

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear current search query, filters, and sorts

279 Open 547 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

[BUG] mpi based training error bug

Something isn't working

training

#6997 opened Feb 4, 2025 by cyr0930

[BUG] loading model error bug

Something isn't working

training

#6994 opened Feb 3, 2025 by tengwang0318

model.parameters() return [Parameter containing: tensor([], device='cuda:0', dtype=torch.bfloat16, requires_grad=True)] when using zero3 bug

Something isn't working

training

#6987 opened Jan 31, 2025 by fanfanffff1

[BUG] Invalidate trace cache warning bug

Something isn't working

training

#6985 opened Jan 30, 2025 by leachim

[BUG] pdsh runner doesn't work with tqdm bar bug

Something isn't working

training

#6978 opened Jan 29, 2025 by Superskyyy

[BUG] libaio on amd node bug

Something isn't working

training

#6972 opened Jan 25, 2025 by GuanhuaWang

[BUG] the input variables may be changed to scalars when use activation checkpoint bug

Something isn't working

training

#6969 opened Jan 23, 2025 by zhangvia

[BUG] z3+compile+gradient checkpoint uses more memory bug

Something isn't working

training

#6966 opened Jan 22, 2025 by oraluben

[BUG] model(**input) cannot use under zero stage 3. bug

Something isn't working

training

#6949 opened Jan 14, 2025 by MarkDeng1

[BUG] deepspeed.initialize changes the output of Llama model bug

Something isn't working

training

#6929 opened Jan 7, 2025 by Ktakuya332C

[BUG]Zero++ training failed bug

Something isn't working

training

#6926 opened Jan 6, 2025 by HelloWorld506

[BUG] Cannot access local variable 'locations' where it is not associated with a value bug

Something isn't working

training

#6913 opened Dec 25, 2024 by Guodanding

[BUG]Convergence Issue: Training BERT for Embedding with Zero2 and 3 as compared to Torchrun bug

Something isn't working

training

#6911 opened Dec 24, 2024 by dawnik17

Using zero3 on multiple nodes is slow bug

Something isn't working

training

#6889 opened Dec 18, 2024 by HelloWorld506

[BUG] Cannot use --hostfile to start multi-node training in Docker. bug

Something isn't working

training

#6875 opened Dec 16, 2024 by Ind1x1

[BUG] Invalidate trace cache @ step 10: expected module 11, but got module 19 bug

Something isn't working

training

#6870 opened Dec 14, 2024 by yafuly

[BUG] Mismatch of model parameters when using Sequence Parallel bug

Something isn't working

training

#6868 opened Dec 13, 2024 by chetwin-character

[BUG]When fine-tuning an LLM, the following error occurs after training for some time: self.optimizer.param_groups[param_group_id]['params'] = [] IndexError: list index out of range bug

Something isn't working

training

#6857 opened Dec 12, 2024 by tdtgi

DeepSpeed with trl bug

Something isn't working

training

#6852 opened Dec 11, 2024 by sagie-dekel

[BUG] DeepSpeed accuracy issue for torch.compile if activation checkpoint function not compiler disabled bug

Something isn't working

training

#6811 opened Dec 1, 2024 by NirSonnenschein

[BUG] Getting "SymIntArrayRef expected to contain only concrete integers" error when > 1 GPU bug

Something isn't working

training

#6806 opened Nov 28, 2024 by rileyhun

[BUG] clip_grad_norm for zero_optimization mode is not working bug

Something isn't working

training

#6767 opened Nov 20, 2024 by chengmengli06

[BUG]NCCL operation timeout when training with deepspeed_zero3_offload or deepspeed_zero3 on RTX4090 bug

Something isn't working

training

#6756 opened Nov 18, 2024 by MLS2021

[BUG] max_grad_norm not effect bug

Something isn't working

training

#6743 opened Nov 12, 2024 by yiyepiaoling0715

[BUG] Zero3 for torch.compile with compiled_autograd when running LayerNorm bug

Something isn't working

training

#6719 opened Nov 6, 2024 by yitingw1

Previous 1 2 3 4 5 … 11 12 Next

Previous Next

ProTip! Updated in the last three days: updated:>2025-02-04.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly