feature(xjy): add the rnd-related features#438
Open
xiongjyu wants to merge 17 commits intoopendilab:mainfrom
Open
feature(xjy): add the rnd-related features#438xiongjyu wants to merge 17 commits intoopendilab:mainfrom
xiongjyu wants to merge 17 commits intoopendilab:mainfrom
Conversation
…d AdamW; add value_priority, adaptive policy entropy control, encoder-clip, label smoothing, latent representation analysis option, and cosine similarity loss.
puyuan1996
reviewed
Nov 10, 2025
…ync latest params
…event premature convergence
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
实验记录文档
https://ai.feishu.cn/wiki/CKOBwu0NJicxa8kkcvOcPQ3yn9H?from=from_copylink
总结
目标
在 Venture 等长期稀疏外部奖励环境下,UniZero/MuZero 仅靠外部回报往往难以形成有效探索轨迹。此研究旨在调优 RND(Random Network Distillation),尽量对齐 OpenAI 原始 RND 实现的网络结构与关键参数,通过充分利用内在奖励,用于提升 UniZero/MuZero 的探索能力。 
准备工作
测试 RND 的开源实现(pytorch版本) 并对核心细节进行消融
Collector:
Learner

结论:
核心实现细节
模型架构
训练过程
监控指标
遇到的问题与解决方案
初始观测归一化过程采集数据过慢
原始的实现代码基于 MuZeroCollector 来实现随机采样数据,每次还需要经过 policy网络,然后在随机采样一个动作。由于这个过程需要采集大量数据,所以大量与policy网络的交互导致速度过慢。
不同变体实验对比
MuZero + RND 的实验
Venture环境
collector
evaluator

learner


rnd_reward_model


UniZero + RND的实验
Venture环境
Collector:
Learner:
RND:
Zork1环境
Collector:
Learner:
RND:

