-
Notifications
You must be signed in to change notification settings - Fork 2.9k
[megatron] feat: use yaml to manage mbridge args #4584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors how parameters are passed to mbridge.save_weights by using a dictionary from the configuration. While this increases flexibility, the current implementation has some critical issues. It inadvertently changes the default behavior by dropping the distributed_filesystem=True parameter and changing memory_efficient from True to False. Additionally, it introduces a risk of a TypeError at runtime if mbridge_config is set to null in the configuration. My review includes suggestions to restore the original default behavior, fix the potential crash, and improve the configuration's robustness.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors how arguments are passed to mbridge.save_weights by introducing a mbridge_config dictionary. This is a good improvement for flexibility and maintainability. The implementation is clean and correctly handles the new configuration. However, I've noticed that in some of the generated configuration files, the value of memory_efficient is set to false, whereas it was previously hardcoded to true. This could lead to increased memory usage during checkpointing. My review includes comments highlighting these specific locations. Please verify if this change is intentional.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the management of mbridge arguments by moving them from hardcoded values into a configurable dictionary in YAML files. This is a good improvement for flexibility. The implementation is mostly correct, but I've found a critical issue where the code does not handle the case where checkpoint_config is None, which could lead to a crash when saving checkpoints. I've provided suggestions to make the code more robust against this potential AttributeError.
robust code Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the management of mbridge arguments by moving them from hardcoded values into YAML configuration files. This is a good improvement for flexibility and maintainability. The changes across the configuration files and the CheckpointConfig dataclass are well-implemented. I've found one minor issue in megatron_checkpoint_manager.py regarding inconsistent and potentially unsafe access to the new configuration property, for which I've left a specific comment.
robust code Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the management of mbridge arguments by moving them into YAML configurations, which is a great step towards more flexible and maintainable code. My main feedback is about the handling of default values for mbridge_config. By setting the default to an empty dictionary in CheckpointConfig, the default behavior of save_weights is altered unless mbridge_config is explicitly defined in every relevant YAML file. This has led to code duplication across several configuration files and makes the system fragile. I've suggested setting the default values directly within the CheckpointConfig dataclass. This will ensure the previous behavior is maintained by default, eliminate the need for duplicated configuration blocks in the YAML files, and make the system more robust for future extensions.
robust code Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors how mbridge arguments are managed by moving them into a mbridge_config dictionary within the checkpoint configuration. This is a good change that improves configurability and helps with version compatibility. I've found a critical issue in the implementation where the code does not handle the case where checkpoint_config might be None, which would lead to a crash. I've provided suggestions to fix this potential AttributeError.
|
cc @ISEEKYAN |
verl/trainer/config/actor/actor.yaml
Outdated
| async_save: False | ||
|
|
||
| # Mbridge config extension. | ||
| mbridge_config: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please don't add these yaml configs.
| log_with_rank(f"Saving HF model checkpoint to {local_path} with bridge", rank=self.rank, logger=logger) | ||
| hf_ckpt_path = get_hf_model_checkpoint_path(local_path) | ||
| if self.vanilla_bridge: | ||
| self.bridge.save_weights( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recommend to modify like this:
import inspect
extended_args = {}
if "distributed_filesystem" in inspect.signature(self.bridge.save_weights):
extended_args["distributed_filesystem"] = True
extended_args["memory_efficient"] = True
self.bridge.save_weights(self.model, hf_ckpt_path, **extended_args)
| if self.vanilla_bridge: | ||
| self.bridge.save_weights( | ||
| self.model, hf_model_ckpt_path, distributed_filesystem=True, memory_efficient=True | ||
| self.model, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
|
In the argument passing section, some modifications were made to better support backward compatibility with older versions and adapt to newer versions of mbridge. |
|
We can use +checkpoint.mbridge_config.memory_efficient to add/set mbridge.save_weights func. |
What does this PR do?
Use dict to manage mbridge args, support modifying mbridge input args, and avoid mbridge version compatibility issues.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)