My model Performs Badly...Is GPU memory to small?

Hi! I trained the model just as you directed, but the model generation is very very bad.It can not even speak a complete sentence...And when I train step3, its reward score is nan.What happened when training?
Please help me......Thanks very much!
Just like this:
```
|E2E latency=3.25s |Gather latency=0.00s (0.00%) |Generate time=2.35s (72.45%) |Training time=0.77s (23.71%) |Others=0.12 (3.84%)|CurSamplesPerSec=1.23 |AvgSamplesPerSec=1.00
[2023-04-15 05:51:04,058] [INFO] [fused_optimizer.py:362:_update_scale]
Grad overflow on iteration 1860
[2023-04-15 05:51:04,058] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1 to 1
[2023-04-15 05:51:04,058] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1
[2023-04-15 05:51:04,174] [INFO] [fused_optimizer.py:362:_update_scale]
Grad overflow on iteration 1860
[2023-04-15 05:51:04,174] [INFO] [fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1 to 1
[2023-04-15 05:51:04,174] [INFO] [logging.py:96:log_dist] [Rank 0] Overflow detected. Skipping step. Attempted loss scale: 1, reducing to 1
epoch: 0|step: 3721|ppo_ep: 1|act_loss: nan|cri_loss: nan|unsuper_loss: 0.0
average reward score: nan
```

I noticed that:
```[fused_optimizer.py:363:_update_scale] Reducing dynamic loss scale from 1 to 1```
**Is the GPU memory too small？**

The chat example:
```
------------------------------ Round 1 ------------------------------                                                                                         Human: Hello!                                                                                                                                                Assistant:  I’m sorry, I’m not sure                                                                                                                       Enter input (type 'quit' to exit, 'clear' to clean memory): What is your name?                                                                               ------------------------------ Round 2 ------------------------------
 Human: Hello!
 Assistant:  I’m sorry, I’m not sure

 Human: What is your name?
 Assistant:  I
Enter input (type 'quit' to exit, 'clear' to clean memory): Can you speak?
------------------------------ Round 3 ------------------------------                                                                                [2/1813] Human: Hello!
 Assistant:  I’m sorry, I’m not sure

 Human: What is your name?
 Assistant:  I

 Human: Can you speak?
 Assistant:  I
Enter input (type 'quit' to exit, 'clear' to clean memory): I think you are saying I?What happened to you?
------------------------------ Round 4 ------------------------------
 Human: Hello!
 Assistant:  I’m sorry, I’m not sure

 Human: What is your name?
 Assistant:  I

 Human: Can you speak?
 Assistant:  I

 Human: I think you are saying I?What happened to you?
 Assistant:  I
Enter input (type 'quit' to exit, 'clear' to clean memory):
```

My Device:
```
single-GPU 1x3090 24GB
batch_size 4 for training and eval.
```

Environment:
```
python                    3.8.0
deepspeed                 0.9.0
huggingface-hub           0.5.1
pytorch                   1.12.1          py3.8_cuda11.3_cudnn8.3.2_0
transformers              4.20.0
```
Thanks for your answering!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

My model Performs Badly...Is GPU memory to small? #307

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

My model Performs Badly...Is GPU memory to small? #307

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions