Open
Conversation
Member
|
Thank you for the improvements on PCL. I haven't checked the implementation details yet, but I think solving the memory issue is great as long as it won't make training slow. Can you show the training curves and computation speeds before and after this PR? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I rewrote the PCL agent to avoid memory issues when saving Variables inside list / replay buffer. I didn't compare the training curve with the old one, but it seems to learn (the average_value increases and R gets bigger) on Catpole under the new parameters and there is no memory issue when run with large network / reasonably long trajectories.
Main methods are the following:
update: take a loss (as an array), log the result as usual and call optimizer (the backprop is done before this function is called)update_on_policyandupdate_from_replay: sample a list of trajectories (from replay or the current one), clear grads and compute losscompute_loss: take a list of trajectories, perform batch computation (batch size is the number of episodes, which may not be efficient when there is one single episode for on-policy update). This function will call backward immediately and only return an array for logging_compute_path_consistency: compute path consistency, this part of code is almost unchangedThe new underlying data structure is a list of dict to store the current episode, then a replay buffer that only stores (s,a,r) pairs. The old mu (action_distrib) is removed since it can be recomputed again from other items.
I also added a unified model in the example script and changed a couple of parameters.
Issues addressed: #109 #236 #240
I am not sure if the parameters are used correctly, but if they are correct, this PR also addresses #238