EAMxx: add version of CldFracNet emulator that uses LAPIS-generated code #7917

bartgol · 2025-12-03T01:04:34Z

Add a version of the pytorch emulator CldFracNet that is purely c++, using LAPIS-generated source code

[BFB]

A few comments:

I extracted all the ml version of cldfrac into its own process, with a different name. NOT to be used in real applications, just an example of how to wrap pytorch models in an atm process. In fact, it is only compiled when tests are enabled.
The compare-nc-files script has been upgraded to accept a tolerance. This is needed as the kokkos-generated model gives slightly different answers than the py-based one, b/c I don't have control on how pytorch optimizes it's execution (or at least I don't know how to)
I included the py script I used to convert the pytorch model to cpp. It requires LAPIS to be installed. The instructions on the lapis repo are quite simple, and I was able to install and run flawlessly.
I thought about generating cld_frac_net.*pp on the fly, but that requires an installation of lapis, which our CI does not have. But more importantly, the lapis installation does require a kokkos intallation, which we don't have at config time (yet). We can think about how to pipeline this phase a bit better, if this is deemed a no-no. For now, those files are added to the repo. They add up to ~500kb of extra storage. Obv, reviewers should not pay attention to these two files, as they are auto-generated.
The ML tests are only enabled on CPU, for now.

…ated c++ emulator

…orch model Requires installation of lapis in the PATH

We need a tolerance based test

mahf708

I would find a way to remove these files from this PR:
cld_frac_net.cpp
cld_frac_net.hpp
cld_frac_net_weights.pth

Additionally, since this is a demo, it would be nice to add some docs, and to give users here some details about what the two laps cmd calls are doing (e.g., would one need different calls for each arch? would one need some sort of clang/llvm installation to deal with the mlir stuff?)

The rest is fine

bartgol · 2025-12-03T16:00:39Z

I would find a way to remove these files from this PR: cld_frac_net.cpp cld_frac_net.hpp cld_frac_net_weights.pth

Additionally, since this is a demo, it would be nice to add some docs, and to give users here some details about what the two laps cmd calls are doing (e.g., would one need different calls for each arch? would one need some sort of clang/llvm installation to deal with the mlir stuff?)

The rest is fine

Agreed regarding the docs, I need to add an md section in the dev docs folder. Regarding the files to remove, I am not entirely sure how to proceed. The hpp/cpp files could in principle be regenerated during the config phase, provided that LAPIS is already installed and available (not that heavy to install, unless LLVM/MLIR also need to be installed). The weights file cannot be generated, and would have to be saved in some input data location. I am not sure if that's the optimal solution though. I have to think about this. What were you envisioning?

mahf708 · 2025-12-03T16:53:34Z

I would find a way to remove these files from this PR: cld_frac_net.cpp cld_frac_net.hpp cld_frac_net_weights.pth
Additionally, since this is a demo, it would be nice to add some docs, and to give users here some details about what the two laps cmd calls are doing (e.g., would one need different calls for each arch? would one need some sort of clang/llvm installation to deal with the mlir stuff?)
The rest is fine

Agreed regarding the docs, I need to add an md section in the dev docs folder. Regarding the files to remove, I am not entirely sure how to proceed. The hpp/cpp files could in principle be regenerated during the config phase, provided that LAPIS is already installed and available (not that heavy to install, unless LLVM/MLIR also need to be installed). The weights file cannot be generated, and would have to be saved in some input data location. I am not sure if that's the optimal solution though. I have to think about this. What were you envisioning?

Weights (and/or model architectures) should be saved externally. The inputdata server is an option. We can also think about saving them to hugginface or some special repo with git-lfs. For now, I think the easiest thing is the inputdata server.

For the hpp/cpp files, I think generating them on-the-fly is likely the only option. I would be against integrating low-level stuff into the repo (but I am only one person, so happy to be overriden). You can imagine how much of this stuff would end up in the repo... If someone wants to run this test/demo, I would force the logic in cmake to trigger complaints of LAPIS isn't available and abort. Another potential option is making a submod for this low-level stuff? Idk, but this is a pretty serious downside of LAPIS as a framework stitching stuff (it should be packaged and vendored into a small binary or python package maybe)

mahf708 · 2025-12-03T20:00:07Z

one other (less ideal) option: just write docs for how to generate these files and thus run test/demo

bartgol · 2025-12-03T21:05:35Z

LAPIS per se is a relatively small package to build/install (about 30 object files), but it does seem to require some peculiar versions of LLVM and torch_mlir, which forces one to manually build those (which add up to about 5k files to build, or about 20min). I tried to use llvm modules already installed on our systems, including some pip installed torch_mlir, but they seem to not work. I have to ping the LAPIS devs to see if they are working to fix this, or if we are creating a dependency on a very specific version of these (large) libs.

Meanwhile, we could install the desired LLVM version in our CI container, along with LAPIS, and generate the hpp/cpp on the fly (using the py script shipped in this PR). Looking at the full lapis installation on my workstation, the folder is ~3.5G (most of which is LLVM), so not a huge increas in the CPU container size. As for the weights file, I suppose the input data server is prob the most reasonable location.

I'm going to talk to LAPIS folks regarding the llvm/torch-mlir issues, and see if this can be simplified. As it stands, the manual LLVM installation process is relatively simple (just follow exactly what their readme file says), but may be hard to maintain in the long term or may cause issues if ppl already have llvm/torch installation that conflict (though the py venv should help with this).

mahf708 · 2025-12-03T21:32:16Z

LAPIS per se is a relatively small package to build/install (about 30 object files), but it does seem to require some peculiar versions of LLVM and torch_mlir, which forces one to manually build those (which add up to about 5k files to build, or about 20min). I tried to use llvm modules already installed on our systems, including some pip installed torch_mlir, but they seem to not work. I have to ping the LAPIS devs to see if they are working to fix this, or if we are creating a dependency on a very specific version of these (large) libs.

Meanwhile, we could install the desired LLVM version in our CI container, along with LAPIS, and generate the hpp/cpp on the fly (using the py script shipped in this PR). Looking at the full lapis installation on my workstation, the folder is ~3.5G (most of which is LLVM), so not a huge increas in the CPU container size. As for the weights file, I suppose the input data server is prob the most reasonable location.

I'm going to talk to LAPIS folks regarding the llvm/torch-mlir issues, and see if this can be simplified. As it stands, the manual LLVM installation process is relatively simple (just follow exactly what their readme file says), but may be hard to maintain in the long term or may cause issues if ppl already have llvm/torch installation that conflict (though the py venv should help with this).

One thing to bring up: they should be able to package all of these peculiar details into a tiny python package. This is pretty doable and arguably the best thing to do here. It may require some python package wizardy (something I would be happy to help with, along with my bots). The idea is: package as much of this (including the cli stuff) behind python, and then keep the extensions (i.e., compiled binaries) hidden. As long as LAPIS only requires those hidden libs, all will work super smoothly, but if it ever needs to interact with other compiled stuff, things can get tricky (this last part was one of the main motivations for conda, as a pip alternative, fwiw). I think this should've been the way... let me know what you gather from them, and we can explore our own path for this PR if they are still unfunded

bartgol added 7 commits December 2, 2025 17:49

EAMxx: upgrade compare-nc-files script to accept a tolerance

68cb1cc

EAMxx: fix micro bug in atm proc pyhelpers

1ea7c28

EAMxx: extract cld frac emulator into its own atm process

39be012

EAMxx: add a version of the cld_frac_net test that uses a LAPIS-gener…

8c3e86b

…ated c++ emulator

EAMxx: better structure the cld frac net process and unit tests

2d9f131

EAMxx: add py script to generate cpp version of cld_frac_net from pyt…

bd5b8a2

…orch model Requires installation of lapis in the PATH

EAMxx: switch cld_frac_net cpp-vs-py comparison to use compare-nc-files

da7fbbe

We need a tolerance based test

bartgol requested review from AaronDonahue and mahf708 December 3, 2025 01:04

bartgol self-assigned this Dec 3, 2025

bartgol added BFB PR leaves answers BFB EAMxx C++ based E3SM atmosphere model (aka SCREAM) AI and emulators labels Dec 3, 2025

mahf708 reviewed Dec 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

EAMxx: add version of CldFracNet emulator that uses LAPIS-generated code #7917

EAMxx: add version of CldFracNet emulator that uses LAPIS-generated code #7917

bartgol commented Dec 3, 2025 •

edited

Loading

Uh oh!

mahf708 left a comment

Uh oh!

bartgol commented Dec 3, 2025

Uh oh!

mahf708 commented Dec 3, 2025

Uh oh!

mahf708 commented Dec 3, 2025

Uh oh!

bartgol commented Dec 3, 2025

Uh oh!

mahf708 commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

EAMxx: add version of CldFracNet emulator that uses LAPIS-generated code #7917

Are you sure you want to change the base?

EAMxx: add version of CldFracNet emulator that uses LAPIS-generated code #7917

Conversation

bartgol commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mahf708 left a comment

Choose a reason for hiding this comment

Uh oh!

bartgol commented Dec 3, 2025

Uh oh!

mahf708 commented Dec 3, 2025

Uh oh!

mahf708 commented Dec 3, 2025

Uh oh!

bartgol commented Dec 3, 2025

Uh oh!

mahf708 commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bartgol commented Dec 3, 2025 •

edited

Loading