Change in checkpoint sizes?

Since about a week, I notice the size of my checkpoints to differ significantly compared to previous runs. I am training the same model on the same data, only on another GPU (A100 80GB vs A100 40GB). Previously, the optimizer states were way larger than the model states, now it is inversed and the inference results are nonsense, in contrast to the results of previous runs. I have to clue what might cause this, is this a common problem?

ds_report output on the new GPU (same as on the old GPU):

```
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
```

Furthermore, the output folders do not contain a `config.json` anymore, which seems to be renamed to `adapter_config.json`
 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change in checkpoint sizes? #5365

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Change in checkpoint sizes? #5365

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions