-
Notifications
You must be signed in to change notification settings - Fork 449
EAMxx: add version of CldFracNet emulator that uses LAPIS-generated code #7917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…ated c++ emulator
…orch model Requires installation of lapis in the PATH
We need a tolerance based test
mahf708
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would find a way to remove these files from this PR:
cld_frac_net.cpp
cld_frac_net.hpp
cld_frac_net_weights.pth
Additionally, since this is a demo, it would be nice to add some docs, and to give users here some details about what the two laps cmd calls are doing (e.g., would one need different calls for each arch? would one need some sort of clang/llvm installation to deal with the mlir stuff?)
The rest is fine
Agreed regarding the docs, I need to add an md section in the dev docs folder. Regarding the files to remove, I am not entirely sure how to proceed. The hpp/cpp files could in principle be regenerated during the config phase, provided that LAPIS is already installed and available (not that heavy to install, unless LLVM/MLIR also need to be installed). The weights file cannot be generated, and would have to be saved in some input data location. I am not sure if that's the optimal solution though. I have to think about this. What were you envisioning? |
Weights (and/or model architectures) should be saved externally. The inputdata server is an option. We can also think about saving them to hugginface or some special repo with git-lfs. For now, I think the easiest thing is the inputdata server. For the hpp/cpp files, I think generating them on-the-fly is likely the only option. I would be against integrating low-level stuff into the repo (but I am only one person, so happy to be overriden). You can imagine how much of this stuff would end up in the repo... If someone wants to run this test/demo, I would force the logic in cmake to trigger complaints of LAPIS isn't available and abort. Another potential option is making a submod for this low-level stuff? Idk, but this is a pretty serious downside of LAPIS as a framework stitching stuff (it should be packaged and vendored into a small binary or python package maybe) |
|
one other (less ideal) option: just write docs for how to generate these files and thus run test/demo |
|
LAPIS per se is a relatively small package to build/install (about 30 object files), but it does seem to require some peculiar versions of LLVM and torch_mlir, which forces one to manually build those (which add up to about 5k files to build, or about 20min). I tried to use llvm modules already installed on our systems, including some pip installed torch_mlir, but they seem to not work. I have to ping the LAPIS devs to see if they are working to fix this, or if we are creating a dependency on a very specific version of these (large) libs. Meanwhile, we could install the desired LLVM version in our CI container, along with LAPIS, and generate the hpp/cpp on the fly (using the py script shipped in this PR). Looking at the full lapis installation on my workstation, the folder is ~3.5G (most of which is LLVM), so not a huge increas in the CPU container size. As for the weights file, I suppose the input data server is prob the most reasonable location. I'm going to talk to LAPIS folks regarding the llvm/torch-mlir issues, and see if this can be simplified. As it stands, the manual LLVM installation process is relatively simple (just follow exactly what their readme file says), but may be hard to maintain in the long term or may cause issues if ppl already have llvm/torch installation that conflict (though the py venv should help with this). |
One thing to bring up: they should be able to package all of these peculiar details into a tiny python package. This is pretty doable and arguably the best thing to do here. It may require some python package wizardy (something I would be happy to help with, along with my bots). The idea is: package as much of this (including the cli stuff) behind python, and then keep the extensions (i.e., compiled binaries) hidden. As long as LAPIS only requires those hidden libs, all will work super smoothly, but if it ever needs to interact with other compiled stuff, things can get tricky (this last part was one of the main motivations for conda, as a pip alternative, fwiw). I think this should've been the way... let me know what you gather from them, and we can explore our own path for this PR if they are still unfunded |
Add a version of the pytorch emulator CldFracNet that is purely c++, using LAPIS-generated source code
[BFB]
A few comments:
cld_frac_net.*ppon the fly, but that requires an installation of lapis, which our CI does not have. But more importantly, the lapis installation does require a kokkos intallation, which we don't have at config time (yet). We can think about how to pipeline this phase a bit better, if this is deemed a no-no. For now, those files are added to the repo. They add up to ~500kb of extra storage. Obv, reviewers should not pay attention to these two files, as they are auto-generated.