Integrating inline ML emulators with Fortran E3SM #7545

andrewgettelman · 2025-07-24T23:14:29Z

andrewgettelman
Jul 24, 2025

Several groups have implementations and are developing implementations to use machine learning emulators (neural networks) in the Fortran E3SM. The E3SM project is focusing on the C++ code bases for doing emulation. There is a need however for projects using Fortran E3SM to be able to add ML components. Ideally these could be added in a format that would also be able to be used by the C++ code (same emulator, but with a C++ interface).

What are people doing in this regard?

There has been some discussion about ongoing efforts, and this discussion is a central place to be able to collect and share information on what is happening.

andrewgettelman · 2025-07-24T23:21:00Z

andrewgettelman
Jul 24, 2025
Author

To start the discussion, a core dump of what I know. I am just an old scientist, so may be horribly uninformed. Please correct me and add.

Most emulators are being developed in pyTorch, so it is desirable to have something that can integrate with pyTorch
There are a few packages that are designed to read TorchScript output files into fortran to do inference (I think they are wrappers on libtorch or similar C++ code)

Examples I know of and some comments:

pytorch-fortran: fortran wrapper for reading torch script files, but it has not been updated in a while

Ftorch (Cambridge): actively been developed now and for ESMs (there is now an implementation in CESM

[FIATS])https://github.com/berkeleylab/fiats) (Berkeley): this can do training in fortran (can it read training from other analyses like pyTorch)?

Do people know of other methods? Who would like to share what they are doing?

Here is what I have learned:
1. James Olawale <[email protected]> and collaborators successfully built FTorch with WW3 in E3SM. See the commit for reference.
2. I am working on trying to follow 1 above to get Ftorch working with EAM following their template in a build script (there is a closed commit on this).
4. Damien Rouson (@rouson) and Ryan Knox (@rgknox) are investigating the use of FIATS with E3SM(ELM)-FATES

What else is going on? What methods are people using? Can we develop a pro-con and some broad consensus on a standard way forward?

For Inference, is FTorch the way to go? (Training could be done with FIATS)

1 reply

rljacob Jul 31, 2025
Maintainer

Also https://github.com/tzhang-ccs/ML4ESM

rgknox · 2025-07-29T16:05:17Z

rgknox
Jul 29, 2025
Collaborator

Thanks for starting the discussion @andrewgettelman

My objective is to train NN's to emulate some of our physics processes in ELM-FATES. Photosynthesis (for example) is our most expensive calculation, so I'd like to see if I can speed up our model by having an alternative NN solve. So far I have used PyTorch to train a few different network architectures. In this vignette I:

1. clone the FATES repository
1. compile a subset of the FATES code
1. build a training dataset by calling the FATES code directly
1. and then train a neural network

The step I'm currently working on is to export the trained model from pytorch, and then import it into the ELM-FATES for the inference (forward steps). @rouson and I plan to intercompare ftorch and fiats.

0 replies

bartgol · 2025-07-29T16:47:19Z

bartgol
Jul 29, 2025
Maintainer

EAMxx is not Fortran, but just to add a data point, we are planning to add ML emulators to EAMxx, and we our planned pipeline will be to use models saved as torch scripts. We are planning to support embedding pyTorch models as well as C++ models generated by "translating" a torch script file to a Kokkos equivalent implementation (via the LAPIS package).

So, nothing to share on our end for the Fortran-torch conversation. I just wanted to share the fact that also on the C++ atm impl we are looking into using (py)torch.

2 replies

andrewgettelman Jul 29, 2025
Author

Thanks @bartgol for sharing. So is the 'we' an official E3SM project version that Andy Salinger was talking about? The goal I think is to use torch script as a common language: if we develop emulators and test at low-res in fortran code, we can then structurally move them into EAMxx without modification of the emulator or training pipeline, and they can be portable.

bartgol Jul 29, 2025
Maintainer

Yeah, "we" means "we, sub-group of EAMxx developers, as part of the AI thrusts of E3SM" I guess? I don't know the official language here... :)

At this stage, I do think it's fine if different groups pursue different strategies, as we are all "exploring" still. That said, if we all use a common underlying tool (like torch scripts) it may help the project later on when defining a "common" pipeline/strategy/guideline for integrating emulators in E3SM (if nothing else, b/c we can point to several use cases as exemplars).

rljacob · 2025-07-29T16:56:28Z

rljacob
Jul 29, 2025
Maintainer

@ambrad made the good point that Ftorch is a wrapper to libtorch, which provides a C++ API. So you only have to build against one thing (libtorch) and both the C++ and Fortran parts of the model could use ML pieces.

6 replies

rouson Jul 29, 2025

TL/DR: Rather than settling on a specific version of the standard, I recommend focusing on whatever is the latest standard and using whatever features the compilers of interest support.

One reason for my recommendation is that standards have mistakes that later get corrected. IIRC, Fortran 2008 was followed by at least three corrigenda containing corrections/clarifications and a fourth corrigendum was written but never published because ISO capped the number of corrigenda at three. So possibly one could go by "Fortran 2008 + the official corrigenda + the unofficial fourth corrigendum", but that approach gets complicated fast. And the committee hasn't published corrigenda in quite a while so some problems only get fixed when the next standard is published.

A second reason is that some improvements are so compelling that there's no reason to wait as long as the compilers of interest support the feature. Or alternatively, some features are so simple that there's no reason to wait. For example, Fortran 2023 expanded the allowable length of a line from 132 characters to something like 5,000 characters. So if you write a line that's 133 characters, you're writing Fortran 2023!

Finally, old standards get withdrawn when a new standard gets published. Strictly speaking, there's no such thing as a Fortran 2008 standard anymore.

rgknox Jul 29, 2025
Collaborator

@rljacob , if we were going to add the configurations and code to link with libtorch (into cime I assume), do we have any existing c++ libraries that we can use as a template?

rljacob Jul 30, 2025
Maintainer

For the case of C++ library with Fortran interfaces, we usually have an installation for each supported Fortran compiler and then the machine config includes the path to the specific build which is chosen with the --compiler option to create_newcase. Search for MOAB_ROOT in E3SM/cime_config/machines/config_machines.xml for examples.

andrewgettelman Jul 30, 2025
Author

I've been playing with FTorch (learning much more about builds than I ever wanted to know), and it seems like WW3 has built it for gnu, and I have built it for ifx. I think it can (and needs to be) built on the same standard torch (or is it libtorch) libraries we use for C++ code. Then it gets a bit more seamless.

This works I guess for inference, I have not thought about training (like Fiats).

A call would be a good idea.

rouson Jul 30, 2025

I can move another meeting to join the Thursday 10 AM PT call if someone sends me a calendar invitation for that call.

A Fortran/C++ interface is inherently more complex than an interface that doesn't have to deal with language interoperability. As I noted privately to @rgknox but hopefully haven't mentioned here yet (hoping I'm not repeating myself), one simple example is that FTorch contains 30 procedures just for converting Fortran pointers to C++ pointers and another 30 procedures for the reverse conversion. That alone is 60 procedures that vanish from the interface if language interoperability is not an issue. There are other restrictions that interoperability imposes -- partly due to C++ lacking many features that Fortran has and sometimes due to the reverse.

rouson · 2025-07-29T18:33:31Z

rouson
Jul 29, 2025

Is there any chance we could move this to a call? There's a lot of ground to cover and I think that might be more efficient than recapitulating everything that my collaborators and I have written elsewhere. If anyone is interested, just click like on this comment and I'll reach out to schedule.

We have several papers that include comparisons between three all-Fortran solutions and three Fortran APIs with C++ back ends, including FTorch. The introduction to our recent workshop paper [1] covers this and includes the use of Fiats for inference with a neural network trained in PyTorch and exported via nexport. Additional examples related to training and inference in the context of atmospheric simulations are in a Jupyter notebook that will appear in conference proceedings soon [2]. Finally, there's a Journal of Open Source Software paper in review [3].

5 replies

ambrad Jul 30, 2025
Collaborator

One possible consideration to discuss: What is the best approach if the long-term plan is to transition all of E3SM to C++?

For example, in the case of the parameterization that started this discussion, I believe the target is EAM, the F90 atmosphere model. Eventually, a successful ML parameterization would be brought into EAMxx, the C++ atmosphere model. The ideal F90 ML solution

would make the eventual F90-to-C++ transition easy;
but avoid burdening the F90 development unnecessarily just to achieve the first goal.

rouson Jul 30, 2025

I'll add two more reasons to my above list:

Compilers don't necessarily implement standards chronologically. Restricting yourself to Fortran 90 limits you to writing use mpi, for example, but most compilers supported mpi_f08 -- which actually relies on Fortran 2018 features -- long before they fully supported the standards that preceded Fortran 2018. By writing use mpi, you're precluding type safety, among other benefits, even though it's likely that every compiler version you care about supports mpi_f08.
It's hard to recruit anyone onto a project if you tell them they can't use any features that were developed in their lifetime! That rules out Fortran 90 for anyone under 35 and Fortran 2003 for anyone under 22. I've had people in my group younger than both of those ages and they were and are enthusiastic about their work using and supporting newer language features.

rljacob Jul 30, 2025
Maintainer

Happy to have a phone call. We could also use our Thursday, 10am PT, infrastructure call.

andrewgettelman Jul 30, 2025
Author

I could do Thursday 10a PT

rljacob Jul 30, 2025
Maintainer

Ok lets discuss tomorrow. I'll add you both to the invite. Anyone else who wants to be added (most in this discussion are already on it) can let me know here on by direct message.

rljacob · 2025-07-30T02:33:21Z

rljacob
Jul 30, 2025
Maintainer

Re: Fortran standards. As a practical matter, we are limited to whatever is supported by the union of the default compilers used across our supported production and test machines.

0 replies

andrewgettelman · 2025-08-01T15:46:06Z

andrewgettelman
Aug 1, 2025
Author

Okay, I've almost got an FTorch example with EAM running, but could use a bit of help. So I will ask generally for those that have worked with FTorch (especially hoping @jonbob, @rgknox and/or @andrewdnolan can comment). It seems that I am not able to pass data correctly between the FTorch fortran and the libtorch C++ code. When trying to pass the device name/number (torch_kCPU) it gets corrupted then the libtorch code dies.

Seems like I have not compiled FTorch correctly. Would anyone be willing to share how they did it?

What I did was this:

>> cd /global/cfs/cdirs/m4549/code
>> git clone https://github.com/Cambridge-ICCS/FTorch.git FTorch_ifx2
>> cd FTorch_ifx2
>> mkdir build
>>  cd build
>> module purge
>> module load PrgEnv-intel/8.3.3
>> module load  intel/2023.1.0 
>> module load cudatoolkit/12.4

>> cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_Fortran_COMPILER=ifx -DCMAKE_INSTALL_PREFIX=/global/cfs/cdirs/m4549/code/FTorch_ifx2 -DCMAKE_PREFIX_PATH=/global/cfs/cdirs/e3sm/software/libtorch/libtorch-cxx11-abi-shared-with-deps-2.6.0+cu124/

>> cmake --build . --target install

But I had to update the definition of

set(GPU_DEVICE_NONE 0)

to

set(GPU_DEVICE_NONE 1)

What did I not do correctly here?

5 replies

andrewdnolan Aug 1, 2025
Collaborator

@jonbob was the one who built the FTorch installation we ended up providing to Olawale for his WW3 work. But he's got his build notes on perlmutter, here:

/global/cfs/cdirs/e3sm/FTorch/build/README

I think you can add the -DGPU_DEVICE=NONE flag to your cmake call and avoid the manual update you have listed above.

My initial guess would be that your build environment isn't properly configured. Jon sourced a .env_mach_specific.sh file generated from the env_mach_specific.xml by ./case.setup.

The other difference is we linked against torch provided as part of the pytorch/2.6.0 module. So the build command seems to have been:

module load pytorch/2.6.0
source .env_mach_specific.sh 
module load cudatoolkit/12.4

cmake .. -DCMAKE_BUILD_TYPE=Release \
         -DCMAKE_PREFIX_PATH=/global/common/software/nersc9/pytorch/2.6.0/lib/python3.12/site-packages/torch \
         -DGPU_DEVICE=NONE \
         -DCMAKE_INSTALL_PREFIX=/global/cfs/cdirs/e3sm/FTorch/

cmake --build . --target install

I can share the .env_mach_specific.sh file used, but not sure if the machine configs for perlmutter have updated since Jon built FTorch. Might be better for you to take that file from a case setup with whatever branch of E3SM you plan to your testing on.

For the sake of simplicity, would linking against the NERSC provided pytorch work for your initial testing?

andrewgettelman Aug 1, 2025
Author

Thanks @jonbob and @andrewdnolan . I will try your suggestions. I like using the env_mach_specific from my case, that's a good idea...

andrewgettelman Aug 1, 2025
Author

@andrewdnolan, Stupid question: how do you make an .env_mach_specific.sh file generated from the env_mach_specific.xml by ./case.setup? I cannot seem to figure that out.

andrewdnolan Aug 1, 2025
Collaborator

@andrewgettelman It's a hidden file, so chances are you've got it in your case directory but it's not coming up with just ls.

If you run ls -la .env_mach_specific.sh from the case directory you want test with, you should see it there. If it's not there, let me know.

andrewgettelman Aug 1, 2025
Author

My bad. I kept missing the . in the name and did not see it! I'll try that. Whoops.

andrewgettelman · 2025-08-08T14:45:46Z

andrewgettelman
Aug 8, 2025
Author

Just an update. With a bunch of help (from @singhbalwinder , @mahf708 , @rgknox , Olawale and ChatGPT) I was able to get FTorch running in EAM with the latest E3SM master.

https://github.com/andrewgettelman/E3SM/tree/SimpleNet_interim_v2

This tag has a version of the FTorch 'SimpleNet' installed inside of the P3 microphysics in EAM. It loads a model in the init step, then calls it at run time. It's for GNU and CPU only right now (that was how I was able everything to build and link). FTorch seems to be a bit dependent on versions and libraries, but if I can eventually solve it, someone who knows what they are doing can probably get a more robust solution.

I will integrate this with the warm rain networks for cloud microphysics in E3SM, and then we should be able to make a network available for Fortran and C++ code as discussed with @agsalin

0 replies

Integrating inline ML emulators with Fortran E3SM #7545

Uh oh!

andrewgettelman Jul 24, 2025

Replies: 8 comments · 19 replies

Uh oh!

Uh oh!

andrewgettelman Jul 24, 2025 Author

Uh oh!

rljacob Jul 31, 2025 Maintainer

Uh oh!

rgknox Jul 29, 2025 Collaborator

Uh oh!

bartgol Jul 29, 2025 Maintainer

Uh oh!

andrewgettelman Jul 29, 2025 Author

Uh oh!

bartgol Jul 29, 2025 Maintainer

Uh oh!

rljacob Jul 29, 2025 Maintainer

Uh oh!

Uh oh!

rouson Jul 29, 2025

Uh oh!

rgknox Jul 29, 2025 Collaborator

Uh oh!

Uh oh!

rljacob Jul 30, 2025 Maintainer

Uh oh!

andrewgettelman Jul 30, 2025 Author

Uh oh!

rouson Jul 30, 2025

Uh oh!

rouson Jul 29, 2025

Uh oh!

ambrad Jul 30, 2025 Collaborator

Uh oh!

rouson Jul 30, 2025

Uh oh!

rljacob Jul 30, 2025 Maintainer

Uh oh!

andrewgettelman Jul 30, 2025 Author

Uh oh!

rljacob Jul 30, 2025 Maintainer

Uh oh!

rljacob Jul 30, 2025 Maintainer

Uh oh!

andrewgettelman Aug 1, 2025 Author

Uh oh!

andrewdnolan Aug 1, 2025 Collaborator

Uh oh!

Uh oh!

andrewgettelman Aug 1, 2025 Author

Uh oh!

andrewgettelman Aug 1, 2025 Author

Uh oh!

andrewdnolan Aug 1, 2025 Collaborator

Uh oh!

andrewgettelman Aug 1, 2025 Author

Uh oh!

andrewgettelman Aug 8, 2025 Author

andrewgettelman
Jul 24, 2025

Replies: 8 comments 19 replies

andrewgettelman
Jul 24, 2025
Author

rljacob Jul 31, 2025
Maintainer

rgknox
Jul 29, 2025
Collaborator

bartgol
Jul 29, 2025
Maintainer

andrewgettelman Jul 29, 2025
Author

bartgol Jul 29, 2025
Maintainer

rljacob
Jul 29, 2025
Maintainer

rgknox Jul 29, 2025
Collaborator

rljacob Jul 30, 2025
Maintainer

andrewgettelman Jul 30, 2025
Author

rouson
Jul 29, 2025

ambrad Jul 30, 2025
Collaborator

rljacob Jul 30, 2025
Maintainer

andrewgettelman Jul 30, 2025
Author

rljacob Jul 30, 2025
Maintainer

rljacob
Jul 30, 2025
Maintainer

andrewgettelman
Aug 1, 2025
Author

andrewdnolan Aug 1, 2025
Collaborator

andrewgettelman Aug 1, 2025
Author

andrewgettelman Aug 1, 2025
Author

andrewdnolan Aug 1, 2025
Collaborator

andrewgettelman Aug 1, 2025
Author

andrewgettelman
Aug 8, 2025
Author