Long initialization times for GFS forecasts #3027
Replies: 65 comments
-
|
Related: Issue #2755 |
Beta Was this translation helpful? Give feedback.
-
|
I was able to replicate @GeorgeVandenberghe-NOAA set-up for the gfs forecast: All of my numbers of tasks match, except for the meditor pets, which I put 1200 instead of 1600 due to a mistake in copying things on my end. I don't get George's speed for the full forecast, or initialization time. George had a 425s iniit time mine was 510s. I have another run going so we can see if that's runtime variability or perhaps associated to the two different runs. Some differences between our runs: |
Beta Was this translation helpful? Give feedback.
-
|
@JessicaMeixner-NOAA For us to investigate the issue, would you please provide a GFS forecast run directory and model version? Thanks |
Beta Was this translation helpful? Give feedback.
-
|
The global-workflow is the best place to get run directories. Unfortunately it cannot update the UFS in its develop branch right now, so I"m working off of branches of both g-w and UFS (to include bug fixes we want in the retrotest while they make their way through the qeuue). It's my understanding that you want something from develop branches, so we'll have to wait a while for that. If someone is okay with branches, I'm running this g-w branch: https://github.com/JessicaMeixner-NOAA/global-workflow/tree/retrotest12 with the yaml file: https://github.com/JessicaMeixner-NOAA/global-workflow/blob/retrotest12/dev/ci/cases/gfsv17/C1152mx025_S2SW.yaml if anyone wants to recreate things. I hopefully will have some updates with Dusan's trace output here soon. |
Beta Was this translation helpful? Give feedback.
-
|
Enkf forecast directory: /lfs/h2/emc/stmp/jessica.meixner/RUNDIRS/trace03/es.2024111506/enkfgdasefcs001.2024111506/fcst.141668 GDAS forecast directory: /lfs/h2/emc/stmp/jessica.meixner/RUNDIRS/trace03/gdas.2024111506/gdasfcst.2024111506/fcst.3289122 Both have trace outputs using @DusanJovic-NOAA's branch of UFS ENKF trace output:
GDAS trace output:
2 min of the 14 is waiting for MOM6 to initialize for enkf. Not sure if this is something that could be improved on @jiandewang @sanAkel It is also long in the GDAS forecast, but only ~35 seconds where the other components are waiting for it. The mediator has about 30 seconds where it's intializing data at the very end. My guess is that can't be moved up in any way. There's also some time in between the data realize of the model components and the "Data Initialize" routine which I"m not sure what exactly is going on because it seems empty. There's also lots of time at the end of the forecast waiting for the write grid component to finish. In the middle of the runs, you see where we are waiting for restarts to be written as well for all of the components. I will hopefully have this for the GFS forecast soon as well. |
Beta Was this translation helpful? Give feedback.
-
|
Looking back at some of my other ESMF profile summaries, for example here: /scratch3/NCEPDEV/climate/Jessica.Meixner/scalingoutput_20250728/time05/gdas.20241115/12/model/atmos/history/ESMF_Profile.summary it looks like increasing the amount of nodes for MOM6 could help decrease the initialization time. That being said, I don't think MOM6 needs for for the run time, but if it could help with init time, that would be good. |
Beta Was this translation helpful? Give feedback.
-
|
I'm afraid there's not much that can be done given the way things are set up:
|
Beta Was this translation helpful? Give feedback.
-
|
I'm seeing mixed results with the ocean initialization - @sanAkel what have you seen to reduce the ocean initialization - fewer or more PET counts? How do we toggle the restart splitting? We can try to see if that impacts the model performance. |
Beta Was this translation helpful? Give feedback.
-
|
The reality on the restart is that the wave takes longer to write restarts, so that would only help for enkf jobs. The work it would take to improve model restart performance on the wave model is not realistic for v17 so that's not currently being pursued. |
Beta Was this translation helpful? Give feedback.
-
|
Suggest increasing cores |
Beta Was this translation helpful? Give feedback.
-
|
Any node counts in mind? Doubling from 100 ->200 had no impact on initialization time in my recent runs. |
Beta Was this translation helpful? Give feedback.
-
|
The optimal number of cores of OCN is between 300 and 360 with not much
runtime change between those numbers. The optimal number for ICE is about
240. The optimal number for WAVE is between 3500 and 4000 single thread
ranks or 7000 double thread ranks. We have not looked at core counts
beyond about 60,000 for ATM but are likely not at the pure ATM scalability
limits if history file and restart time can be eliminated.
…On Wed, Jul 30, 2025 at 4:55 PM Santha Akella ***@***.***> wrote:
*sanAkel* left a comment (ufs-community/ufs-weather-model#2831)
<#2831 (comment)>
Suggest increasing cores
—
Reply to this email directly, view it on GitHub
<#2831 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FRXQSYQS544PM4WMS33LEWL5AVCNFSM6AAAAACCSA7NCWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTCMZXG44TCMRTGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
Beta Was this translation helpful? Give feedback.
-
@GeorgeVandenberghe-NOAA - Those are optimal numbers for the GFS forecast, correct? We have 3 different forecasts: gfs, gdas, and enkfgdas (which doesn't even have waves). |
Beta Was this translation helpful? Give feedback.
-
|
These are optimal numbers for the C1152 GFS forecast
…On Wed, Jul 30, 2025 at 5:02 PM Jessica Meixner ***@***.***> wrote:
*JessicaMeixner-NOAA* left a comment
(ufs-community/ufs-weather-model#2831)
<#2831 (comment)>
The optimal number of cores of OCN is between 300 and 360 with not much
runtime change between those numbers. The optimal number for ICE is about
240. The optimal number for WAVE is between 3500 and 4000 single thread
ranks or 7000 double thread ranks. We have not looked at core counts
beyond about 60,000 for ATM but are likely not at the pure ATM scalability
limits if history file and restart time can be eliminated.
@GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> -
Those are optimal numbers for the GFS forecast, correct?
We have 3 different forecasts: gfs, gdas, and enkfgdas (which doesn't even
have waves).
—
Reply to this email directly, view it on GitHub
<#2831 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FT7LJJ3MHNIQ4ZSJ2T3LEXEXAVCNFSM6AAAAACCSA7NCWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTCMZXHAYDQMZUGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
Beta Was this translation helpful? Give feedback.
-
|
So I've looked into my logs a bit more and I need to do a clean test, but it might simply be the latest update in MOM6 tripling the initialization time of MOM6. If that's the case, then hopefully that's something that can be tracked down and improved. FYI: @sanAkel @jiandewang |
Beta Was this translation helpful? Give feedback.
-
|
@DusanJovic-NOAA - I was reading through #1554 and was curious if you've already tested the reproducibility and other operational requirements with these new build options? I'm assuming they should all pass, but thought I'd double check. |
Beta Was this translation helpful? Give feedback.
-
|
I guess I'm just trying to follow up on why we didn't adopt this before for all components. |
Beta Was this translation helpful? Give feedback.
-
I checked the reproducibility while I was testing the gfs fcst configuration. Output files were identical between successive runs. I also ran a full regression test on Ursa and all test passed against newly created baselines. New baselines were needed for all Intel non-debug tests. |
Beta Was this translation helpful? Give feedback.
-
|
Long ago, in the days of yore, when We could add |
Beta Was this translation helpful? Give feedback.
-
|
*Do I need to add -DFASTER=ON to the cmake below?*
cmake .. -DAPP=S2SWA -D32BIT=ON
-DCCPP_SUITES=FV3_GFS_v17_p8_ugwpv1,FV3_GFS_v17_coupled_p8_ugwpv1,FV3_global_nest_v1
-DPDLIB=ON -DCMAKE_BUILD_TYPE=Release -DMPI=ON
make -j 8 VERBOSE=1
This is how I build UFS
git clone --recursive https://github.com/DusanJovic-NOAA/ufs-weather-model
ufast
cd ufast
git checkout faster
cd modulefiles ; module use `/bin/pwd` ; module load ufs_wcoss2.intel ; cd
..
rm -rf build ; mkdir build ; cd build
cmake .. -DAPP=S2SWA -D32BIT=ON
-DCCPP_SUITES=FV3_GFS_v17_p8_ugwpv1,FV3_GFS_v17_coupled_p8_ugwpv1,FV3_global_nest_v1
-DPDLIB=ON -DCMAKE_BUILD_TYPE=Release -DMPI=ON
make -j 8 VERBOSE=1
…On Fri, Oct 10, 2025 at 9:54 AM Dusan Jovic ***@***.***> wrote:
*DusanJovic-NOAA* left a comment (ufs-community/ufs-weather-model#2831)
<#2831 (comment)>
@DusanJovic-NOAA <https://github.com/DusanJovic-NOAA> - I was reading
through #1554
<#1554> and was
curious if you've already tested the reproducibility and other operational
requirements with these new build options? I'm assuming they should all
pass, but thought I'd double check.
I checked the reproducibility while I was testing the gfs fcst
configuration. Output files were identical between successive runs. I also
ran a full regression test on Ursa and all test passed against newly
created baselines. New baselines were needed for all Intel non-debug tests.
—
Reply to this email directly, view it on GitHub
<#2831 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FXLM3CRRKPYXTU3YXD3W63CNAVCNFSM6AAAAACCSA7NCWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGOJQGI4DGNBWG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
Beta Was this translation helpful? Give feedback.
-
Yes, just add that to what you already have in the cmake command. |
Beta Was this translation helpful? Give feedback.
-
|
cmake .. -DFASTER=ON -DAPP=S2SWA -D32BIT=ON
-DCCPP_SUITES=FV3_GFS_v17_p8_ugwpv1,FV3_GFS_v17_coupled_p8_ugwpv1,FV3_global_nest_v1
-DPDLIB=ON -DCMAKE_BUILD_TYPE=Release -DMPI=ON
So I do this in red?
…On Fri, Oct 10, 2025 at 10:42 AM Daniel Sarmiento ***@***.***> wrote:
*dpsarmie* left a comment (ufs-community/ufs-weather-model#2831)
<#2831 (comment)>
*Do I need to add -DFASTER=ON to the cmake below?*
Yes, just add that to what you already have in the cmake command.
—
Reply to this email directly, view it on GitHub
<#2831 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FSNBWSVA7WDXXECDOD3W7AWXAVCNFSM6AAAAACCSA7NCWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGOJQGU2DINBTGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
Beta Was this translation helpful? Give feedback.
-
I don't see the red marking, but that cmake command should do it. |
Beta Was this translation helpful? Give feedback.
-
|
@JessicaMeixner-NOAA I was thinking of converting this issue to a GitHub Discussion. It seems more like a general discussion and Q&A around the long initialization times rather than an issue proposing a fix. If there is a PR planned to address the long initialization times, however, I could see keeping it as an issue. What do you think? |
Beta Was this translation helpful? Give feedback.
-
|
@dpsarmie - should we close issue or convert to discussion? |
Beta Was this translation helpful? Give feedback.
-
|
Keep it as an issue until we have applied all of the fixes we intend to
apply
…On Thu, Dec 18, 2025 at 5:15 PM Jessica Meixner ***@***.***> wrote:
*JessicaMeixner-NOAA* left a comment
(ufs-community/ufs-weather-model#2831)
<#2831 (comment)>
@dpsarmie <https://github.com/dpsarmie> - should we close issue or
convert to discussion?
—
Reply to this email directly, view it on GitHub
<#2831 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FXVJ4XVB22I4SP3CY34CMROTAVCNFSM6AAAAACCSA7NCWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMNZSGQ3TKNZSGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
Beta Was this translation helpful? Give feedback.
-
|
@GeorgeVandenberghe-NOAA In that case it sounds like we should be converting this to a discussion, and when developers are putting in specific fixes, they should open an issue with the specific problem/solution. It's not clear from this issue description what specific fixes are proposed. It's also not clear that they would be applied directly in the ufs-weather-model repository (components and apps are both discussed here to varying degrees). Normally a GitHub Issue body should contain some combination of description/solution/alternatives/steps for reproducing. This is a great conversation, but it's not really outlining in a clear or concise manner which specific problem(s) will be addressed (and what solutions are proposed). It's a robust discussion, but a discussion nonetheless. |
Beta Was this translation helpful? Give feedback.
-
|
@JessicaMeixner-NOAA I just noticed that moving this to a discussion closed this on your GFSv17 project board. You may have to create specific issues for the planned work and add them to your board, since I'm pretty sure Discussions can't be added to a project. If that's going to be a problem, I can convert this back to an issue, but as a general practice, I think we'd like to aim for conversations (even these more technical ones) to reside under Discussions and for Issues to be proposed code changes. |
Beta Was this translation helpful? Give feedback.
-
|
@gspetro-NOAA I'll go ahead and link PRs/issues in this discussion related to the GFS runtime so that there's a git history. |
Beta Was this translation helpful? Give feedback.
-
|
Here are the relevant PRs and issues that were worked on in order to resolve this runtime issue: Adding the -O3 compiler option for all subroutinesThe option for "-DFASTER" was available on some but not all subroutines. Allow FV3 to use saved routehandlesFV3 has the capability to generate and use saved route handles, but the required option was not available in the namelist templates. The option was added and an RT was changed to use the route handle functionality for testing purposes. Change the number of PETs in mediatorTesting has shown that we can increase the number of PETs for CMEPS to ~4800. Previously there was a hard cap of 1200 (300 per thread x 4 threads). For the GFSv17 configuration on WCOSS2, 4800 cores was found to be the optimal number. Increasing the number of PETs above 4800 created a slowdown. Remove unused fields in UFSATM module_cplfields.F90A speedup in initialization was also seen when removing unused fields in CMEPS Route handlesCMEPS also has the capability to generate and use saved route handles, similar to FV3. However, testing showed that there was a slowdown in overall runtime even though there was a reduction in CMEPS initialization. This is currently being worked on by folks at EMC and ESMF. |
Beta Was this translation helpful? Give feedback.


Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
In GFS, we have 3 forecasts: enkfgdas (C384mx025 S2S), gdas (C1152mx025 S2SW), gfs (C1152mx025 S2SW). For enkfgdas and gdas forecasts we can hit our targeted resource/time - although more improvements are always welcomed! However, for GFS we are still working towards reducing the runtime. In particular we have noticed that the initialization time seems long.
For GFS forecasts:
Depending on the various PET counts, for example I have had initialization times range from 325 to 556s (5.4min - 9.2min). (Note various PET counts for various components changed between these runs likely contributing to the variation).
For GDAS forecasts:
This is the same set-up as GFS forecast although the PET counts are different and the desired forecast length is different.
GDAS initialization in recent runs vary from 251 to 282s (over 4 minutes).
For enkfgdas forecasts:
Initialization times were seen from 133-235s. This has lower atm model resolution and no waves.
Data Locations
The original locations of these runs are WCOSS2 and can be found at: /lfs/h2/emc/ptmp/Jessica.Meixner/comroot
I have copied the ESMF profiles, forecast logs and a few of the configuration files to Ursa here: /scratch3/NCEPDEV/climate/Jessica.Meixner/scalingoutput_20250728
Some Notes:
Next steps for me:
Beta Was this translation helpful? Give feedback.
All reactions