Fix memory leak by j3soon · Pull Request #73 · oxwhirl/pymarl

j3soon · 2020-07-03T14:53:12Z

In EpisodeRunner, the actions should either be detached from the computation graph or be converted into a numpy array before being stored into the replay buffer. In the original code, the entire computation graph for generating the action isn't released and consume unnecessary amount of memory. And may cause OOM if the program has been ran for a long time.
In Logger, the stats should be cleared periodically to avoid accumulating unnecessary logs in the memory.

GoingMyWay · 2020-07-09T04:30:08Z

@j3soon Hi, this is an interesting issue. How many steps should be run to reproduce the OOM issue? Does it exist?

j3soon · 2020-07-09T08:12:52Z

For the memory leak in EpisodeRunner:

Running EpisodicRunner using QMIX with 8M steps using hidden layer with 512 neurons should reproduce the issue.

python3 src/main.py --config=qmix --env-config=sc2 with env_args.map_name=27m_vs_30m runner=episode rnn_hidden_dim=512 test_interval=40000 log_interval=40000 runner_log_interval=40000 learner_log_interval=40000 t_max=8200000 save_model_interval=400000

The CPU memory will grow linearly through time.

For the memory in Logger
```
python3 src/main.py --config=iql --env-config=sc2 with env_args.map_name=3m runner=episode test_interval=40000 log_interval=1 runner_log_interval=1 learner_log_interval=1 t_max=8200000 save_model_interval=400000
```
Try logging more data (Maybe 100KB-1MB data per 1 timestep), the CPU memory will also grow linearly through time.

This issue is not as severe as EpisodicRunner, since the size of the log is small in default. The issue only arises when we modify the Logger to log more data that consumes more memory.

GoingMyWay · 2020-07-09T16:05:01Z

For the memory leak in EpisodeRunner:
Running EpisodicRunner using QMIX with 8M steps using hidden layer with 512 neurons should reproduce the issue.
python3 src/main.py --config=qmix --env-config=sc2 with env_args.map_name=27m_vs_30m runner=episode rnn_hidden_dim=512 test_interval=40000 log_interval=40000 runner_log_interval=40000 learner_log_interval=40000 t_max=8200000 save_model_interval=400000
The CPU memory will grow linearly through time.
For the memory in Logger
python3 src/main.py --config=iql --env-config=sc2 with env_args.map_name=3m runner=episode test_interval=40000 log_interval=1 runner_log_interval=1 learner_log_interval=1 t_max=8200000 save_model_interval=400000
Try logging more data (Maybe 100KB-1MB data per 1 timestep), the CPU memory will also grow linearly through time.
This issue is not as severe as EpisodicRunner, since the size of the log is small in default. The issue only arises when we modify the Logger to log more data that consumes more memory.

Great finding. Cool, do you think the issue is from the actions that are not detached from the graph?

j3soon · 2020-07-09T22:36:32Z

Yes. PyTorch maintains a computation graph during the forward pass to record the tensor operations. When the loss is defined and we perform tensor.backward, the computation graph is back-traced for backpropagation and released along the way.

During training, the computation graph is released since we do perform tensor.backward. While during collecting episode experiences (interacting with the environment), the action is calculated through forward passes and the action tensor is stored directly into the replay buffer. Since tensor.backward isn't called anywhere, the computation graph isn't released, making the memory consumption of the action tensor unreasonably large. Thus we should call either action = action.detach() or action = action.cpu().numpy(). Both of them releases the computation graph.

GoingMyWay · 2020-07-10T03:18:51Z

@j3soon Fantastic finding. I will try it. Very good.

Fix memory leak

2d25e6f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix memory leak#73

Fix memory leak#73
j3soon wants to merge 1 commit intooxwhirl:masterfrom
j3soon:master

j3soon commented Jul 3, 2020

Uh oh!

GoingMyWay commented Jul 9, 2020

Uh oh!

j3soon commented Jul 9, 2020 •

edited

Loading

Uh oh!

GoingMyWay commented Jul 9, 2020

Uh oh!

j3soon commented Jul 9, 2020

Uh oh!

GoingMyWay commented Jul 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

j3soon commented Jul 3, 2020

Uh oh!

GoingMyWay commented Jul 9, 2020

Uh oh!

j3soon commented Jul 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GoingMyWay commented Jul 9, 2020

Uh oh!

j3soon commented Jul 9, 2020

Uh oh!

GoingMyWay commented Jul 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

j3soon commented Jul 9, 2020 •

edited

Loading