Skip to content

feature(pu): add atari/dmc multitask and balance pipeline in ScaleZero paper#451

Merged
puyuan1996 merged 102 commits intomainfrom
dev-multitask-balance-clean-kvcachemanager
Jan 8, 2026
Merged

feature(pu): add atari/dmc multitask and balance pipeline in ScaleZero paper#451
puyuan1996 merged 102 commits intomainfrom
dev-multitask-balance-clean-kvcachemanager

Conversation

@puyuan1996
Copy link
Collaborator

@puyuan1996 puyuan1996 commented Dec 3, 2025

This pull request implements the core components of the ScaleZero paper by introducing a multi-task, balanced training pipeline for Atari and DeepMind Control (DMC) environments.

To enhance stability and performance in this new multi-task setting, several key improvements and bug fixes were made. We replaced BatchNorm with the more robust LayerNorm, corrected a critical bug that caused the kv_cache to be improperly overwritten, and fixed the state reset logic in _reset_eval() and _reset_collect() to ensure accurate evaluation.

Additionally, the PR introduces target-entropy control for better policy optimization, makes the number of MCTS simulations configurable for evaluation, and integrates relevant updates from the longrun PR #400 to maintain code consistency.

本次 PR 核心是实现了 ScaleZero 论文的关键部分,为 Atari 和 DeepMind Control (DMC) 环境引入了一套多任务(multi-task)且均衡(balanced)的训练流水线

为确保在多任务场景下的稳定性和高性能,我们进行了一系列关键优化与修复:将不稳定的 BatchNorm 替换为更鲁棒的 LayerNorm;修复了导致状态错误的 kv_cache 重写 Bug;并修正了 _reset_eval() 和 _reset_collect() 中的状态重置逻辑,以保证评估的准确性。

此外,本次更新还引入了 target-entropy 控制机制以优化策略,并使评估阶段的 MCTS 模拟次数变为可配置项。同时,我们整合了 longrun PR #400 的相关变更,以保持代码库的统一和同步。

puyuan1996 and others added 30 commits April 25, 2025 11:26
MCTS stage 3: Backup
At the end of the simulation, the statistics along the trajectory are updated.
"""
# search_depth is used for rope in UniZero
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为啥 ctree_sampled 这边,没有根据用不用 rope(timestep) 划分分支

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sampled还不支持rope,加到todo里了


# Clear caches if the current steps are a multiple of the clear interval
if current_steps % clear_interval == 0:
if current_steps is not None and current_steps % clear_interval == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个间隔是怎么设置的呢

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前如果sample_type='transition',是按照 game_segment_length 启发式设置的


# Log mapping
self.logits_key_mapping = {
'policy': 'logits_policy',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉 clip 还是在 encoder 和 transformer backbone 弄吧,head 的可以去掉了

@puyuan1996 puyuan1996 merged commit 81db0b2 into main Jan 8, 2026
1 of 6 checks passed
@puyuan1996 puyuan1996 added the refactor Cleanup, formatting, or restructuring of existing code. label Jan 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

config New or improved configuration enhancement New feature or request refactor Cleanup, formatting, or restructuring of existing code. research Research work in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants