-
Notifications
You must be signed in to change notification settings - Fork 3
Update compiler flags #148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
🚀 Attempted to deploy 🖥️
|
|
It takes ~2 hours to build the code... I don't think this is what we expect? Or there is something I'm missing? |
|
It’s possibly just a transient Gadi thing ? That is a bit long , I don’t think it’s going to be related to changing the compiler flags. |
|
🚀 Attempted to deploy 🖥️
|
|
!redeploy |
|
I thought the long esmf build issue had already been resolved - is it still a problem, or is there a workaround to speed it up? |
|
🚀 Attempted to deploy 🖥️
|
|
🚀 Attempted to deploy 🖥️
|
|
!redeploy |
|
🚀 Attempted to deploy 🖥️
|
|
🚀 Attempted to deploy 🖥️
|
|
!redeploy |
|
🚀 Attempted to deploy 🖥️
|
|
With the help of https://github.com/ACCESS-NRI/esmf-trace, I was able to extract timestep-level performance data, rather than relying on coarse wall-clock runtime. This allows each test case to be much shorter (only 2 days of simulation for below tests), while still producing reliable performance statistics. I tested three compiler configurations, each repeated five times (
In each run, the first timestep appears very very fast (not shown in below plot). I’m not entirely certain of the cause, but likely reflects nuopc bookkeeping or something I am not sure (happy to take comments from others on this). The second timestep is then the first timestep where the full dynamics is active and it is consistently the slowest “real” step. This is supported where fms timing in Similarly, the final timestep is noticeably slower due to I/O things.
So to avoid biasing the statistics with these startup/shutdown artifacts, I exclude steps 1–2 and the final timestep, leaving 189 consistent timesteps per run for the statistical comparison.
From the box plot, compiler_flag_2 consistently achieves the lowest mean and median timestep runtime while also showing a tighter variance and no strong outliers, in contrast to the other two compiler setups. |
|
Below shows the setup for the three configurations using experiment-generator model_type: access-om3
repository_url: [email protected]:ACCESS-NRI/access-om3-configs.git
start_point: "7851c1e" # Control commit hash for new branches
test_path: "." # All control and perturbation experiment repositories will be created here; can be relative, absolute or ~ (user-defined)
repository_directory: mom6-cice6-compiler_flags_25km_ryf # Local directory name for the central repository (user-defined)
control_branch_name: ctrl
Control_Experiment:
Perturbation_Experiment:
Parameter_block_compiler_flags_2days_from_scratch:
branches:
- "compiler_flag_1" # default
- "compiler_flag_2" # pr148-1
- "compiler_flag_3" # pr148-7
config.yaml:
manifest:
reproduce:
exe: False
modules:
use:
- - PRESERVE
- - /g/data/vk83/prerelease/modules
- /g/data/vk83/modules
- - /g/data/vk83/prerelease/modules
- /g/data/vk83/modules
load:
- - PRESERVE
- - access-om3/pr148-1 # not adding model-tools/mppnccombine-fast because the positional indexing
- - access-om3/pr148-7 # not adding model-tools/mppnccombine-fast because the positional indexing
env:
ESMF_RUNTIME_PROFILE: "on"
ESMF_RUNTIME_TRACE: "on"
ESMF_RUNTIME_TRACE_PETLIST: "0 1 291-292 1975"
ESMF_RUNTIME_PROFILE_OUTPUT: "SUMMARY"
repeat: True
nuopc.runconfig:
CLOCK_attributes:
restart_n: 2
restart_option: ndays
stop_n: 2
stop_option: ndays
And this is the yaml plan for the experiment-runner test_path: /g/data/tm70/ml0072/COMMON/git_repos/access-experiment-generator/performance_runnings_ncmas/om3 # All control and perturbation experiment repositories.
repository_directory: mom6-cice6-compiler_flags_25km_ryf # Local directory name for the central repository, where the running_branches are forked from.
keep_uuid: True
running_branches: # List of experiment branches to run.
- compiler_flag_1 # default
- compiler_flag_2 # pr148-1
- compiler_flag_3 # pr148-7
nruns: # Number of runs for each branch; must match the order of running_branches.
- 5
- 5
- 5
startfrom_restart:
- cold
- cold
- cold |
|
🚀 Attempted to deploy 🖥️
|
|
!redeploy |
|
🚀 Attempted to deploy 🖥️
|
I would expect all avail repro tests had been done using |
|
Ah right - it runs the tests in the ci.json file - see https://github.com/ACCESS-NRI/access-om3-configs/blob/69b60c0c5719270d83a27a8d70b6d67b47e0f62b/config/ci.json#L26 That is some confusing re-use of terminology though ! |
|
so can i do |
|
@minghangli-uni - @anton-seaice is right regarding not being able to apply flags in the |
Nope, unfortunately you can't choose which tests are run from within the PR |
|
In the most recent failed build, from It says: @manodeep says:
Even though |
Looks like the line is calling the archiver. Instead of |
Can you test @minghangli-uni ? It looks like it should be
(and is maybe missing from https://github.com/ACCESS-NRI/spack-config/blob/main/common/gadi/linux/compilers.yaml ?) But the best way to test it is not clear to me? There is a CMAKE_AR variable, but i don't know how to set that from a |
|
🚀 Attempted to deploy 🖥️
|
|
🚀 Attempted to deploy 🖥️
|
|
Both It seems likely that the previous failure was caused still by filesystem rather than the build setup itself? |
|
🚀 Attempted to deploy 🖥️
|
|
🚀 Attempted to deploy 🖥️
|


Same flags as in #82 (comment)
🚀 The latest prerelease
access-om3/pr148-18at 927e963 is here: #148 (comment) 🚀