Skip to content

CUDA bfloat16 problem #122

Open
Open
@A-2-H

Description

@A-2-H
2023-10-23 10:49:30,409 WARNING: logs/HiFiSVC doesn't exist yet!
Global seed set to 594461
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: logs/HiFiSVC
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
  | Name          | Type                     | Params
-----------------------------------------------------------
0 | generator     | HiFiSinger               | 14.9 M
1 | mpd           | MultiPeriodDiscriminator | 57.5 M
2 | msd           | MultiScaleDiscriminator  | 29.6 M
3 | mel_transform | MelSpectrogram           | 0     
-----------------------------------------------------------
102 M     Trainable params
0         Non-trainable params
102 M     Total params
408.124   Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:442: PossibleUserWarning: The dataloader, val_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Sanity Checking DataLoader 0:   0% 0/2 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/content/fish-diffusion/tools/hifisinger/train.py", line 83, in <module>
    trainer.fit(model, train_loader, valid_loader, ckpt_path=args.resume)
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
    call._call_and_handle_interrupt(
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 980, in _run
    results = self._run_stage()
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1021, in _run_stage
    self._run_sanity_check()
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1050, in _run_sanity_check
    val_loop.run()
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py", line 181, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 115, in run
    self._evaluation_step(batch, batch_idx, dataloader_idx)
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 376, in _evaluation_step
    output = call._call_strategy_hook(trainer, hook_name, *step_kwargs.values())
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 294, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 391, in validation_step
    with self.precision_plugin.val_step_context():
  File "/content/env/envs/fish_diffusion/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 170, in val_step_context
    with self.forward_context():
  File "/content/env/envs/fish_diffusion/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/amp.py", line 118, in forward_context
    with self.autocast_context_manager():
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/amp.py", line 113, in autocast_context_manager
    return torch.autocast(self.device, dtype=torch.bfloat16 if self.precision == "bf16-mixed" else torch.half)
  File "/content/env/envs/fish_diffusion/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 234, in __init__
    raise RuntimeError('Current CUDA Device does not support bfloat16. Please switch dtype to float16.')
RuntimeError: Current CUDA Device does not support bfloat16. Please switch dtype to float16.

Google Colab
using T4 gpu and any other gpu - still the same error when I tried to train my model

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions