Skip to content

Debug regression #1196

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Debug regression #1196

wants to merge 1 commit into from

Conversation

raytomato
Copy link
Contributor

This PR is created from main branch by dropping PRs merged recently. Trying to see if the CI issue was introduced by the PRs.

* Fix CI and add memory monitoring

* minor scripts improvements

* Update .github/workflows/pre-commit.yml

resolve boyue's comment.

Co-authored-by: Boyue Li <[email protected]>

* make monitor process non-critical upon failure

---------

Co-authored-by: Boyue Li <[email protected]>
@raytomato
Copy link
Contributor Author

raytomato commented May 21, 2025

Drop 041050d : https://github.com/apple/axlearn/actions/runs/15154143718/job/42605629824

Peak mem usage is at around 71%, slightly lower.

=== Memory Check: 2025-05-21 05:25:29 ===
Memory: 9.8Gi used, 147Mi free, 15Gi total
Memory usage: 62.8%
=== Memory Check: 2025-05-21 05:25:[59](https://github.com/apple/axlearn/actions/runs/15154143718/job/42605629824#step:6:60) ===
Memory: 10Gi used, 135Mi free, 15Gi total
Memory usage: [67](https://github.com/apple/axlearn/actions/runs/15154143718/job/42605629824#step:6:68).6%
=== Memory Check: 2025-05-21 05:26:29 ===
Memory: 11Gi used, 160Mi free, 15Gi total
Memory usage: [71](https://github.com/apple/axlearn/actions/runs/15154143718/job/42605629824#step:6:72).0%
=== Memory Check: 2025-05-21 05:26:59 ===
Memory: 11Gi used, 107Mi free, 15Gi total
Memory usage: 71.3%
=== Memory Check: 2025-05-21 05:27:29 ===
Memory: 11Gi used, 182Mi free, 15Gi total
Memory usage: 70.9%
=== Memory Check: 2025-05-21 05:27:59 ===
Memory: 8.8Gi used, 2.5Gi free, 15Gi total
Memory usage: 56.4%

@raytomato
Copy link
Contributor Author

raytomato commented May 21, 2025

Drop 041050d and 8d4dedf: https://github.com/apple/axlearn/actions/runs/15154303034/job/42606047318?pr=1196

=== Memory Check: 2025-05-21 05:38:29 ===
Memory: 11Gi used, 126Mi free, 15Gi total
Memory usage: 74.0%
=== Memory Check: 2025-05-21 05:38:59 ===
Memory: 11Gi used, 158Mi free, 15Gi total
Memory usage: [76](https://github.com/apple/axlearn/actions/runs/15154303034/job/42606047318?pr=1196#step:6:77).1%

@raytomato
Copy link
Contributor Author

These are the only two PRs merged yesterday. The mem usage peak difference doesn't seem significant enough to trigger the failure. (And it aligns with the expectation if we check what these two commits are about.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant