-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Issues: deepspeedai/DeepSpeed
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
[BUG] mpi based training error
bug
Something isn't working
training
#6997
opened Feb 4, 2025 by
cyr0930
[BUG] loading model error
bug
Something isn't working
training
#6994
opened Feb 3, 2025 by
tengwang0318
model.parameters() return [Parameter containing: tensor([], device='cuda:0', dtype=torch.bfloat16, requires_grad=True)] when using zero3
bug
Something isn't working
training
#6987
opened Jan 31, 2025 by
fanfanffff1
[BUG] Invalidate trace cache warning
bug
Something isn't working
training
#6985
opened Jan 30, 2025 by
leachim
[BUG] pdsh runner doesn't work with tqdm bar
bug
Something isn't working
training
#6978
opened Jan 29, 2025 by
Superskyyy
[BUG] libaio on amd node
bug
Something isn't working
training
#6972
opened Jan 25, 2025 by
GuanhuaWang
[BUG] the input variables may be changed to scalars when use activation checkpoint
bug
Something isn't working
training
#6969
opened Jan 23, 2025 by
zhangvia
[BUG] z3+compile+gradient checkpoint uses more memory
bug
Something isn't working
training
#6966
opened Jan 22, 2025 by
oraluben
[BUG] model(**input) cannot use under zero stage 3.
bug
Something isn't working
training
#6949
opened Jan 14, 2025 by
MarkDeng1
[BUG] Something isn't working
training
deepspeed.initialize
changes the output of Llama model
bug
#6929
opened Jan 7, 2025 by
Ktakuya332C
[BUG]Zero++ training failed
bug
Something isn't working
training
#6926
opened Jan 6, 2025 by
HelloWorld506
[BUG] Cannot access local variable 'locations' where it is not associated with a value
bug
Something isn't working
training
#6913
opened Dec 25, 2024 by
Guodanding
[BUG]Convergence Issue: Training BERT for Embedding with Zero2 and 3 as compared to Torchrun
bug
Something isn't working
training
#6911
opened Dec 24, 2024 by
dawnik17
Using zero3 on multiple nodes is slow
bug
Something isn't working
training
#6889
opened Dec 18, 2024 by
HelloWorld506
[BUG] Cannot use --hostfile to start multi-node training in Docker.
bug
Something isn't working
training
#6875
opened Dec 16, 2024 by
Ind1x1
[BUG] Invalidate trace cache @ step 10: expected module 11, but got module 19
bug
Something isn't working
training
#6870
opened Dec 14, 2024 by
yafuly
[BUG] Mismatch of model parameters when using Sequence Parallel
bug
Something isn't working
training
#6868
opened Dec 13, 2024 by
chetwin-character
[BUG]When fine-tuning an LLM, the following error occurs after training for some time: self.optimizer.param_groups[param_group_id]['params'] = [] IndexError: list index out of range
bug
Something isn't working
training
#6857
opened Dec 12, 2024 by
tdtgi
[BUG] DeepSpeed accuracy issue for torch.compile if activation checkpoint function not compiler disabled
bug
Something isn't working
training
#6811
opened Dec 1, 2024 by
NirSonnenschein
[BUG] Getting "SymIntArrayRef expected to contain only concrete integers" error when > 1 GPU
bug
Something isn't working
training
#6806
opened Nov 28, 2024 by
rileyhun
[BUG] clip_grad_norm for zero_optimization mode is not working
bug
Something isn't working
training
#6767
opened Nov 20, 2024 by
chengmengli06
[BUG]NCCL operation timeout when training with deepspeed_zero3_offload or deepspeed_zero3 on RTX4090
bug
Something isn't working
training
#6756
opened Nov 18, 2024 by
MLS2021
[BUG] max_grad_norm not effect
bug
Something isn't working
training
#6743
opened Nov 12, 2024 by
yiyepiaoling0715
[BUG] Zero3 for torch.compile with compiled_autograd when running LayerNorm
bug
Something isn't working
training
#6719
opened Nov 6, 2024 by
yitingw1
Previous Next
ProTip!
Updated in the last three days: updated:>2025-02-04.