Skip to content

My model Performs Badly...Is GPU memory to small? #307

Open
@Trace2333

Description

@Trace2333

Hi! I trained the model just as you directed, but the model generation is very very bad.It can not even speak a complete sentence...And when I train step3, its reward score is nan.What happened when training?
Please help me......Thanks very much!
Just like this:

|E2E latency=3.25s |Gather latency=0.00s (0.00%) |Generate time=2.35s (72.45%) |Training time=0.77s (23.71%) |Others=0.12 (3.84%)|CurSamplesPerSec=1.23 |AvgSamplesPerSec=1.00
[2023-04-15 05:51:04,058] [INFO] [fused_optimizer.py:362:_update_scale]
Grad overflow on iteration 1860
[2023-04-15 05:51:04,058] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1 to 1
[2023-04-15 05:51:04,058] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1
[2023-04-15 05:51:04,174] [INFO] [fused_optimizer.py:362:_update_scale]
Grad overflow on iteration 1860
[2023-04-15 05:51:04,174] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1 to 1
[2023-04-15 05:51:04,174] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1
epoch: 0|step: 3721|ppo_ep: 1|act_loss: nan|cri_loss: nan|unsuper_loss: 0.0
average reward score: nan

I noticed that:
[fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1 to 1
Is the GPU memory too small?

The chat example:

------------------------------ Round 1 ------------------------------                                                                                         Human: Hello!                                                                                                                                                Assistant:  I’m sorry, I’m not sure                                                                                                                       Enter input (type 'quit' to exit, 'clear' to clean memory): What is your name?                                                                               ------------------------------ Round 2 ------------------------------
 Human: Hello!
 Assistant:  I’m sorry, I’m not sure

 Human: What is your name?
 Assistant:  I
Enter input (type 'quit' to exit, 'clear' to clean memory): Can you speak?
------------------------------ Round 3 ------------------------------                                                                                [2/1813] Human: Hello!
 Assistant:  I’m sorry, I’m not sure

 Human: What is your name?
 Assistant:  I

 Human: Can you speak?
 Assistant:  I
Enter input (type 'quit' to exit, 'clear' to clean memory): I think you are saying I?What happened to you?
------------------------------ Round 4 ------------------------------
 Human: Hello!
 Assistant:  I’m sorry, I’m not sure

 Human: What is your name?
 Assistant:  I

 Human: Can you speak?
 Assistant:  I

 Human: I think you are saying I?What happened to you?
 Assistant:  I
Enter input (type 'quit' to exit, 'clear' to clean memory):

My Device:

single-GPU 1x3090 24GB
batch_size 4 for training and eval.

Environment:

python                    3.8.0
deepspeed                 0.9.0
huggingface-hub           0.5.1
pytorch                   1.12.1          py3.8_cuda11.3_cudnn8.3.2_0
transformers              4.20.0

Thanks for your answering!

Metadata

Metadata

Assignees

Labels

deespeed chatDeepSpeed ChatmodelingRelated to modeling questions.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions