Skip to content

Conversation

@bartgol
Copy link
Contributor

@bartgol bartgol commented Dec 3, 2025

Add a version of the pytorch emulator CldFracNet that is purely c++, using LAPIS-generated source code

[BFB]


A few comments:

  • I extracted all the ml version of cldfrac into its own process, with a different name. NOT to be used in real applications, just an example of how to wrap pytorch models in an atm process. In fact, it is only compiled when tests are enabled.
  • The compare-nc-files script has been upgraded to accept a tolerance. This is needed as the kokkos-generated model gives slightly different answers than the py-based one, b/c I don't have control on how pytorch optimizes it's execution (or at least I don't know how to)
  • I included the py script I used to convert the pytorch model to cpp. It requires LAPIS to be installed. The instructions on the lapis repo are quite simple, and I was able to install and run flawlessly.
  • I thought about generating cld_frac_net.*pp on the fly, but that requires an installation of lapis, which our CI does not have. But more importantly, the lapis installation does require a kokkos intallation, which we don't have at config time (yet). We can think about how to pipeline this phase a bit better, if this is deemed a no-no. For now, those files are added to the repo. They add up to ~500kb of extra storage. Obv, reviewers should not pay attention to these two files, as they are auto-generated.
  • The ML tests are only enabled on CPU, for now.

@bartgol bartgol self-assigned this Dec 3, 2025
@bartgol bartgol added BFB PR leaves answers BFB EAMxx C++ based E3SM atmosphere model (aka SCREAM) AI and emulators labels Dec 3, 2025
Copy link
Contributor

@mahf708 mahf708 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would find a way to remove these files from this PR:
cld_frac_net.cpp
cld_frac_net.hpp
cld_frac_net_weights.pth

Additionally, since this is a demo, it would be nice to add some docs, and to give users here some details about what the two laps cmd calls are doing (e.g., would one need different calls for each arch? would one need some sort of clang/llvm installation to deal with the mlir stuff?)

The rest is fine

@bartgol
Copy link
Contributor Author

bartgol commented Dec 3, 2025

I would find a way to remove these files from this PR: cld_frac_net.cpp cld_frac_net.hpp cld_frac_net_weights.pth

Additionally, since this is a demo, it would be nice to add some docs, and to give users here some details about what the two laps cmd calls are doing (e.g., would one need different calls for each arch? would one need some sort of clang/llvm installation to deal with the mlir stuff?)

The rest is fine

Agreed regarding the docs, I need to add an md section in the dev docs folder. Regarding the files to remove, I am not entirely sure how to proceed. The hpp/cpp files could in principle be regenerated during the config phase, provided that LAPIS is already installed and available (not that heavy to install, unless LLVM/MLIR also need to be installed). The weights file cannot be generated, and would have to be saved in some input data location. I am not sure if that's the optimal solution though. I have to think about this. What were you envisioning?

@mahf708
Copy link
Contributor

mahf708 commented Dec 3, 2025

I would find a way to remove these files from this PR: cld_frac_net.cpp cld_frac_net.hpp cld_frac_net_weights.pth
Additionally, since this is a demo, it would be nice to add some docs, and to give users here some details about what the two laps cmd calls are doing (e.g., would one need different calls for each arch? would one need some sort of clang/llvm installation to deal with the mlir stuff?)
The rest is fine

Agreed regarding the docs, I need to add an md section in the dev docs folder. Regarding the files to remove, I am not entirely sure how to proceed. The hpp/cpp files could in principle be regenerated during the config phase, provided that LAPIS is already installed and available (not that heavy to install, unless LLVM/MLIR also need to be installed). The weights file cannot be generated, and would have to be saved in some input data location. I am not sure if that's the optimal solution though. I have to think about this. What were you envisioning?

Weights (and/or model architectures) should be saved externally. The inputdata server is an option. We can also think about saving them to hugginface or some special repo with git-lfs. For now, I think the easiest thing is the inputdata server.

For the hpp/cpp files, I think generating them on-the-fly is likely the only option. I would be against integrating low-level stuff into the repo (but I am only one person, so happy to be overriden). You can imagine how much of this stuff would end up in the repo... If someone wants to run this test/demo, I would force the logic in cmake to trigger complaints of LAPIS isn't available and abort. Another potential option is making a submod for this low-level stuff? Idk, but this is a pretty serious downside of LAPIS as a framework stitching stuff (it should be packaged and vendored into a small binary or python package maybe)

@mahf708
Copy link
Contributor

mahf708 commented Dec 3, 2025

one other (less ideal) option: just write docs for how to generate these files and thus run test/demo

@bartgol
Copy link
Contributor Author

bartgol commented Dec 3, 2025

LAPIS per se is a relatively small package to build/install (about 30 object files), but it does seem to require some peculiar versions of LLVM and torch_mlir, which forces one to manually build those (which add up to about 5k files to build, or about 20min). I tried to use llvm modules already installed on our systems, including some pip installed torch_mlir, but they seem to not work. I have to ping the LAPIS devs to see if they are working to fix this, or if we are creating a dependency on a very specific version of these (large) libs.

Meanwhile, we could install the desired LLVM version in our CI container, along with LAPIS, and generate the hpp/cpp on the fly (using the py script shipped in this PR). Looking at the full lapis installation on my workstation, the folder is ~3.5G (most of which is LLVM), so not a huge increas in the CPU container size. As for the weights file, I suppose the input data server is prob the most reasonable location.

I'm going to talk to LAPIS folks regarding the llvm/torch-mlir issues, and see if this can be simplified. As it stands, the manual LLVM installation process is relatively simple (just follow exactly what their readme file says), but may be hard to maintain in the long term or may cause issues if ppl already have llvm/torch installation that conflict (though the py venv should help with this).

@mahf708
Copy link
Contributor

mahf708 commented Dec 3, 2025

LAPIS per se is a relatively small package to build/install (about 30 object files), but it does seem to require some peculiar versions of LLVM and torch_mlir, which forces one to manually build those (which add up to about 5k files to build, or about 20min). I tried to use llvm modules already installed on our systems, including some pip installed torch_mlir, but they seem to not work. I have to ping the LAPIS devs to see if they are working to fix this, or if we are creating a dependency on a very specific version of these (large) libs.

Meanwhile, we could install the desired LLVM version in our CI container, along with LAPIS, and generate the hpp/cpp on the fly (using the py script shipped in this PR). Looking at the full lapis installation on my workstation, the folder is ~3.5G (most of which is LLVM), so not a huge increas in the CPU container size. As for the weights file, I suppose the input data server is prob the most reasonable location.

I'm going to talk to LAPIS folks regarding the llvm/torch-mlir issues, and see if this can be simplified. As it stands, the manual LLVM installation process is relatively simple (just follow exactly what their readme file says), but may be hard to maintain in the long term or may cause issues if ppl already have llvm/torch installation that conflict (though the py venv should help with this).

One thing to bring up: they should be able to package all of these peculiar details into a tiny python package. This is pretty doable and arguably the best thing to do here. It may require some python package wizardy (something I would be happy to help with, along with my bots). The idea is: package as much of this (including the cli stuff) behind python, and then keep the extensions (i.e., compiled binaries) hidden. As long as LAPIS only requires those hidden libs, all will work super smoothly, but if it ever needs to interact with other compiled stuff, things can get tricky (this last part was one of the main motivations for conda, as a pip alternative, fwiw). I think this should've been the way... let me know what you gather from them, and we can explore our own path for this PR if they are still unfunded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI and emulators BFB PR leaves answers BFB EAMxx C++ based E3SM atmosphere model (aka SCREAM)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants