Update train.py by Li-Guanda · Pull Request #109 · isaac-sim/IsaacGymEnvs

Li-Guanda · 2023-02-02T09:36:02Z

A command like "python train.py task=Ant headless=True sim_device=cpu rl_device=cpu" can not work correctly. The reason is "rlg_config_dict" doesn't include the information of "rl_device".

In the "a2c_common.py" of "rl_games", there is a line of code: "self.ppo_device = config.get('device', 'cuda:0')". So the RL algorithm will always only work on the cuda:0.

A command like "python train.py task=Ant headless=True sim_device=cpu rl_device=cpu" can not work correctly. The reason is "rlg_config_dict" doesn't include the information of "rl_device". In the "a2c_common.py" of "rl_games", there is a line of code: "self.ppo_device = config.get('device', 'cuda:0')". So the RL algorithm will always only work on the cuda:0.

tylerlum · 2023-04-11T04:37:35Z

I encountered the same issue! This fix should work, but I think a cleaner solution would be to avoid making a change in train.py ("feels" more like a hack), but instead modify all the *PPO.yaml files (eg. AntPPO.yaml)

We should add in under params.config

params:
....
  config:
....
    device: ${resolve_default:cuda:0,${....rl_device}}  # Used in rl_games/common/a2c_common.py
    device_name: ${resolve_default:cuda:0,${....rl_device}}  # Used in rl_games/common/player.py

This is similar to other config values like

    name: ${resolve_default:Ant,${....experiment}}
    multi_gpu: ${....multi_gpu}
    num_actors: ${....task.env.numEnvs}
    max_epochs: ${resolve_default:500,${....max_iterations}}

which use values from the top-level config (which has rl_device), but also gives a default value.

utomm · 2023-04-25T03:19:23Z

Hi, thanks for the fix and discussion. Your solution works well with device like 'cuda:2' #129 .

However when using rl_device=cpu, the process still crashed. In training case python train.py task=Ant headless=True sim_device=cpu rl_device=cpu

it will crash before the first update of the policy

Error executing job with overrides: ['task=Ant', 'headless=True', 'sim_device=cpu', 'rl_device=cpu']
Traceback (most recent call last):
  File "train.py", line 161, in launch_rlg_hydra
    'sigma' : None
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/torch_runner.py", line 120, in run
    self.run_train(args)
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/torch_runner.py", line 101, in run_train
    agent.train()
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/common/a2c_common.py", line 1173, in train
    step_time, play_time, update_time, sum_time, a_losses, c_losses, b_losses, entropies, kls, last_lr, lr_mul = self.train_epoch()
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/common/a2c_common.py", line 1059, in train_epoch
    a_loss, c_loss, entropy, kl, last_lr, lr_mul, cmu, csigma, b_loss = self.train_actor_critic(self.dataset[i])
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/algos_torch/a2c_continuous.py", line 159, in train_actor_critic
    self.calc_gradients(input_dict)
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/algos_torch/a2c_continuous.py", line 135, in calc_gradients
    self.scaler.scale(loss).backward()
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py", line 162, in scale
    assert outputs.is_cuda or outputs.device.type == 'xla'
AssertionError

and in testing case python train.py task=Ant headless=True sim_device=cpu rl_device=cpu test=True, the tensors in different device appears again, the output error is

Error executing job with overrides: ['task=Ant', 'headless=True', 'sim_device=cpu', 'rl_device=cpu', 'test=True']
Traceback (most recent call last):
  File "train.py", line 161, in launch_rlg_hydra
    'sigma' : None
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/torch_runner.py", line 123, in run
    self.run_play(args)
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/torch_runner.py", line 108, in run_play
    player.run()
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/common/player.py", line 208, in run
    action = self.get_action(obses, is_determenistic)
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/algos_torch/players.py", line 55, in get_action
    res_dict = self.model(input_dict)
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/algos_torch/models.py", line 246, in forward
    input_dict['obs'] = self.norm_obs(input_dict['obs'])
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/algos_torch/models.py", line 49, in norm_obs
    return self.running_mean_std(observation) if self.normalize_input else observation
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/algos_torch/running_mean_std.py", line 79, in forward
    y = (input - current_mean.float()) / torch.sqrt(current_var.float() + self.epsilon)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

tylerlum · 2023-07-11T17:53:04Z

NOTE: edited solution above to include device_name. This fixes the problem for python train.py task=Ant headless=True sim_device=cpu rl_device=cpu test=True.

This doesn't fix the other issue though. I believe this is from

        self.scaler = torch.cuda.amp.GradScaler(enabled=self.mixed_precision)

in rl_games/common/a2c_common.py, which would need more work to fix

utomm mentioned this pull request Apr 25, 2023

Trying to run on cuda:1 crashes #129

Closed

lgd21356 mentioned this pull request Sep 21, 2024

Fix a bug that can not set the training device correctly NVlabs/CALM#18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update train.py#109

Update train.py#109
Li-Guanda wants to merge 1 commit intoisaac-sim:mainfrom
Li-Guanda:patch-1

Li-Guanda commented Feb 2, 2023

Uh oh!

tylerlum commented Apr 11, 2023 •

edited

Loading

Uh oh!

utomm commented Apr 25, 2023

Uh oh!

tylerlum commented Jul 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Li-Guanda commented Feb 2, 2023

Uh oh!

tylerlum commented Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

utomm commented Apr 25, 2023

Uh oh!

tylerlum commented Jul 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tylerlum commented Apr 11, 2023 •

edited

Loading