Closed
Conversation
ae605bf to
c743fcf
Compare
We should be able to set batches with supersets of reserved_keys, there's no reason to not allow that
Would lead to duplicated initialization of parents, in particular nn.Module, which is problematic
dbcfad2 to
0ebf152
Compare
Collaborator
Author
|
After a long investigation, we understood that this is fundamentally the wrong approach. preprocess_batch of the off-policy algo uses the learned value function to estimate returns, and the value function changes all the time with gradient updates. So it's not possible to just compute the returns once, or rather, it gives a poo estimate. Would work with purely monte-carlo estimates of returns, but we don't control how an algorithm would estimate returns, and we currently don't have a single algorithm implementation of that kind. Closing this PR |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
For @arnaujc91
Background: currently the pre-processing of most offline learning algorithms is done in
_preprocess_batch, which is highly suboptimal. Instead it should be done inprocess_buffer.In this PR, a general class that implements
process_bufferusing_preprocess_batchis introduced that allows converting an OffPolicyAlgorithm into an efficient OfflineAlgorithm.Current status: when using it to improve performance of
TD3BC, the algorithm converges and tests pass, but the determinism test fails, meaning something changed in the processing or at least in the random number generation. If it's the latter, the failure is not a problem, but I currently don't see why any rng related things should have changed.The implementation that changes the buffer's managed batch is rather hacky. I suspect something goes wrong with the indexing but after 20 mins of debugging I haven't yet pinned down what causes the determinism test to fail. Understanding this will require reading through the sample_indices in
ReplayBufferandReplayBufferManager. Note that there is inconsistency in howsample_indices(None)is handled between the two, but it shouldn't play a role for this PR.In the course of this PR, determinism snapshots should be created from the dev-v2 branch, then after switching to this branch, the determinism test of td3bc should succeed. See docstring of
AlgorithmDeterminismTestindeterminism_test.pyfor further details.The PR is finished when all relevant offline algorithms inherit from
OfflineAlgorithmFromOffPolicyAlgorithmand either the determinism tests pass, or a source of different random number generation caused by the refactoring has been identified.