Reverted README for now.

vbharadwaj-bk · vbharadwaj-bk · commit 5984d6ded966 · 2025-06-04T22:07:08.000-07:00
diff --git a/.github/workflows/docs.yaml b/.github/workflows/docs.yaml
@@ -0,0 +1,53 @@
+name: Deploy documentation to Github Pages 
+on:
+  workflow_dispatch:
+
+# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
+permissions: write-all
+
+# Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued.
+# However, do NOT cancel in-progress runs as we want to allow these production deployments to complete.
+concurrency:
+  group: "pages"
+  cancel-in-progress: false
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+      - name: Setup Pages
+        uses: actions/configure-pages@v3
+      - name: Set up Python 3.10
+        uses: actions/setup-python@v5
+        with:
+         python-version: "3.10"
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install sphinx furo
+      - name: Build website
+        run: |
+          cd docs
+          sphinx-build -M dirhtml docs docs/_build
+          
+      - name: Fix permissions
+        run: |
+          chmod -c -R +rX "docs/_build/dirhtml" | while read line; do
+            echo "::warning title=Invalid file permissions automatically fixed::$line"
+          done
+      - name: Upload artifact
+        uses: actions/upload-pages-artifact@v3
+        with:
+          path: './docs/_build/dirhtml'
+  deploy:
+    environment:
+      name: github-pages
+      url: ${{ steps.deployment.outputs.page_url }}
+    runs-on: ubuntu-latest
+    needs: build
+    steps:
+      - name: Deploy to GitHub Pages
+        id: deployment
+        uses: actions/deploy-pages@v4
diff --git a/README.md b/README.md
@@ -2,7 +2,8 @@
 [![OEQ CUDA C++ Extension Build Verification](https://github.com/PASSIONLab/OpenEquivariance/actions/workflows/verify_extension_build.yml/badge.svg?event=push)](https://github.com/PASSIONLab/OpenEquivariance/actions/workflows/verify_extension_build.yml)
 [![License](https://img.shields.io/badge/License-BSD_3--Clause-blue.svg)](https://opensource.org/licenses/BSD-3-Clause)
 
-[[Examples]](#show-me-some-examples) 
+[[Examples]](#show-me-some-examples) [[Installation]](#installation)
+[[Supported Tensor Products]](#tensor-products-we-accelerate)
 [[Citation and Acknowledgements]](#citation-and-acknowledgements)
 
 OpenEquivariance is a CUDA and HIP kernel generator for the Clebsch-Gordon tensor product, 
@@ -128,6 +129,197 @@ print(torch.norm(Z))
 `deterministic=False`, the `sender` and `receiver` indices can have
 arbitrary order.
 
+**New:** If you're working in FP32 precision and want
+higher accuracy during graph convolution, we offer a Kahan 
+summation variant of our deterministic algorithm:
+
+```python
+tp_conv_kahan = oeq.TensorProductConv(problem, torch_op=True, deterministic=True, kahan=True) 
+Z = tp_conv_kahan.forward(X, Y[receiver_perm], W[receiver_perm], edge_index[0], edge_index[1], sender_perm) 
+print(torch.norm(Z))
+```
+
+## Installation 
+We currently support Linux systems only. 
+Before installation and the first library import, 
+ensure that the command 
+`c++ --version` returns GCC 9+; if not, set the
+`CC` and `CXX` environment variables to point to
+valid compilers. On NERSC Perlmutter,
+`module load gcc` will set up your environment
+correctly. 
+
+To install, run
+```bash
+pip install git+https://github.com/PASSIONLab/OpenEquivariance
+```
+After installation, the very first library
+import will trigger a build of a C++ extension we use,
+which takes longer than usual.
+All subsequent imports will not retrigger compilation.
+
+## Replicating our benchmarks 
+To run our benchmark suite, you'll also need the following packages: 
+- `e3nn`, 
+- `cuEquivariance`
+- `cuEquivariance-torch` 
+- `cuEquivariance-ops-torch-cu11` OR `cuEquivariance-ops-torch-cu12` 
+- `matplotlib` (to reproduce our figures) 
+
+You can get all the necessary dependencies via our optional dependencies `[bench]`
+
+```bash
+pip install "git+https://github.com/PASSIONLab/OpenEquivariance[bench]"
+```
+
+We conducted our benchmarks on an NVIDIA A100-SXM-80GB GPU at
+Lawrence Berkeley National Laboratory. Your results may differ 
+a different GPU.
+
+The file `tests/benchmark.py` can reproduce the figures in 
+our paper an A100-SXM4-80GB GPU. 
+Run it with the following invocations: 
+```bash
+python tests/benchmark.py -o outputs/uvu uvu --plot
+python tests/benchmark.py -o outputs/uvw uvw --plot
+python tests/benchmark.py -o outputs/roofline roofline --plot
+python tests/benchmark.py -o outputs/conv conv --plot --data data/molecular_structures
+python tests/benchmark.py -o outputs/kahan_conv kahan_conv --data data/molecular_structures/
+```
+
+If your GPU has limited memory, you might want to try
+the `--limited-memory` flag to disable some expensive
+tests and / or reduce the batch size with `-b`. Run
+`python tests/benchmark.py --help` for a full list of flags.
+
+Here's a set
+of invocations for an A5000 GPU:
+
+```bash
+python tests/benchmark.py -o outputs/uvu uvu --limited-memory --plot
+python tests/benchmark.py -o outputs/uvw uvw -b 25000 --plot
+python tests/benchmark.py -o outputs/roofline roofline --plot
+python tests/benchmark.py -o outputs/conv conv --data data/molecular_structures --limited-memory
+```
+Note that for GPUs besides the one we used in our 
+testing, the roofline slope / peak will be incorrect, and your results
+may differ from the ones we've reported. The plots for the convolution fusion
+experiments also require a GPU with a minimum of 40GB of memory. 
+
+## Testing Correctness
+See the `dev` dependencies in `pyproject.toml`; you'll need `e3nn`,
+`pytest`, `torch_geometric`, and `pytest-check` installed. You can test batch 
+tensor products and fused convolution tensor products as follows:
+```bash
+pytest tests/batch_test.py 
+pytest tests/conv_test.py 
+```
+Browse the file to select specific tests.
+
+## Compilation with JITScript, Export, and AOTInductor 
+OpenEquivariance supports model compilation with
+`torch.compile`, JITScript, `torch.export`, and AOTInductor. 
+Demo the C++ model exports with
+```bash
+pytest tests/export_test.py 
+```
+NOTE: the AOTInductor test (and possibly export) fail 
+unless you are using a Nightly
+build of PyTorch past 4/10/2025 due to incomplete support for 
+TorchBind in earlier versions.
+
+## Running MACE
+**NOTE**: If you're revisiting this page, the repo containing
+our up-to-date MACE integration has changed! See the instructions
+below; we use a branch off a fork of MACE to facilitate
+PRs into the main codebase.
+
+We have modified MACE to use our accelerated kernels instead
+of the standard e3nn backend. Here are the steps to replicate
+our MACE benchmark:
+
+1. Install `oeq` and our modified version of MACE:
+```bash
+pip uninstall mace-torch
+pip install git+https://github.com/PASSIONLab/OpenEquivariance
+pip install git+https://github.com/vbharadwaj-bk/mace_oeq_integration.git@oeq_experimental
+```
+
+2. Download the `carbon.xyz` data file, available at <https://portal.nersc.gov/project/m1982/equivariant_nn_graphs/>. 
+   This graph has 158K edges. With the original e3nn backend, you would need a GPU with 80GB
+   of memory to run the experiments. `oeq` provides a memory-efficient equivariant convolution, so we expect
+   the test to succeed.
+
+3. Benchmark OpenEquivariance: 
+```bash
+python tests/mace_driver.py carbon.xyz -o outputs/mace_tests -i oeq
+```
+
+4. If you have a GPU with 80GB of memory OR supply a smaller molecular graph
+   as the input file, you can run the full benchmark that includes `e3nn` and `cue`: 
+```bash
+python tests/mace_driver.py carbon.xyz -o outputs/mace_tests -i e3nn cue oeq
+```
+
+## Tensor products we accelerate
+
+| Operation                | CUDA     | HIP |
+|--------------------------|----------|-----|
+| UVU                      | ✅        | ✅    |
+| UVW                      | ✅        | ✅    |
+| UVU + Convolution        | ✅        | ✅    |
+| UVW + Convolution        | ✅        | ✅    |
+| Symmetric Tensor Product | ✅ (beta) | ✅ (beta)  |
+
+e3nn supports a variety of connection modes for CG tensor products. We support 
+two that are commonly used in equivariant graph neural networks:
+"uvu" and "uvw". Our JIT compiled kernels should handle:
+
+1. Pure "uvu" tensor products, which are most efficient when the input with higher
+multiplicities is the first argument. Our results are identical to e3nn when irreps in
+the second input have multiplicity 1, and otherwise identical up to a reordering
+of the input weights.
+
+2. Pure "uvw" tensor products, which are currently more efficient when the input with
+higher multiplicities is the first argument. Our results are identical to e3nn up to a reordering
+of the input weights. 
+
+Our code includes correctness checks, but the configuration space is large. If you notice
+a bug, let us know in a Github issue. We'll try our best to correct it or document the problem here.
+
+We do not (yet) support:
+
+- Mixing different instruction types in the same tensor product. 
+- Instruction types besides "uvu" and "uvw".
+- Non-trainable instructions: all of your instructions must have weights associated. 
+
+If you have a use case for any of the unsupported features above, let us know.
+
+We have recently added beta support for symmetric
+contraction acceleration. Because this is a kernel 
+specific to MACE, we require e3nn as dependency
+to run it, and there is currently no support for
+compile / export (coming soon!), we 
+do not expose it in the package
+toplevel. You can test out our implementation by
+running 
+
+```python
+from openequivariance.implementations.symmetric_contraction import SymmetricContraction as OEQSymmetricContraction
+```
+
+## Multidevice / Stream Support
+To use OpenEquivariance on multiple GPUs of a single
+compute node, we currently require that all GPUs 
+share the same compute capability. This is because
+our kernels are compiled based on the shared memory
+capacity of the numerically first visible GPU card. 
+On heterogeneous systems, you can still 
+use OpenEquivariance on all GPUs that match the
+compute capability of the first visible device.
+
+We are working on support for CUDA streams!
+
 ## Citation and Acknowledgements
 If you find this code useful, please cite our paper:
 
@@ -137,7 +329,9 @@ author={Vivek Bharadwaj and Austin Glover and Aydin Buluc and James Demmel},
 title={An Efficient Sparse Kernel Generator for O(3)-Equivariant Deep Networks}, 
 booktitle = {SIAM Conference on Applied and Computational Discrete Algorithms (ACDA25)},
 chapter = {},
-url={https://arxiv.org/abs/2501.13986}
+url={https://arxiv.org/abs/2501.13986},
+publisher={Society for Industrial and Applied Mathematics},
+year={2025}
 }
 ```
 
@@ -154,4 +348,4 @@ Copyright (c) 2025, The Regents of the University of California, through Lawrenc
 
 If you have questions about your rights to use or distribute this software, please contact Berkeley Lab's Intellectual Property Office at IPO@lbl.gov.
 
-NOTICE. This Software was developed under funding from the U.S. Department of Energy and the U.S. Government consequently retains certain rights. As such, the U.S. Government has been granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the Software to reproduce, distribute copies to the public, prepare derivative works, and perform publicly and display publicly, and to permit others to do so.
+NOTICE. This Software was developed under funding from the U.S. Department of Energy and the U.S. Government consequently retains certain rights. As such, the U.S. Government has been granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the Software to reproduce, distribute copies to the public, prepare derivative works, and perform publicly and display publicly, and to permit others to do so.