myl7
diff --git a/‎.gitignore‎
Lines changed: 6 additions & 2 deletions b/‎.gitignore‎
Lines changed: 6 additions & 2 deletions
diff --git a/‎.gitmodules‎
Lines changed: 18 additions & 0 deletions b/‎.gitmodules‎
Lines changed: 18 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 6 additions & 0 deletions b/‎README.md‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎doc/bench_aes128_soft.md‎
Lines changed: 25 additions & 20 deletions b/‎doc/bench_aes128_soft.md‎
Lines changed: 25 additions & 20 deletions
@@ -3,8 +3,9 @@
 .idea
 
 # Build outputs
-/build
-/cmake-build-*
+build
+cmake-build-*
+target
 
 # Doc outputs
 /doc/html
@@ -18,3 +19,6 @@ superpowers
 # Benchmarks
 perf.data
 perf.data.old
+
+# Bazel symlinks
+/third_party/distributed_point_functions/bazel-*
@@ -0,0 +1,18 @@
+[submodule "libfss"]
+	path = third_party/libfss/libfss
+	url = https://github.com/frankw2/libfss.git
+[submodule "libdpf"]
+	path = third_party/libdpf
+	url = https://github.com/weikengchen/libdpf.git
+[submodule "distributed_point_functions"]
+	path = third_party/distributed_point_functions/distributed_point_functions
+	url = https://github.com/google/distributed_point_functions.git
+[submodule "GPU-DPF"]
+	path = third_party/GPU-DPF/GPU-DPF
+	url = https://github.com/facebookresearch/GPU-DPF.git
+[submodule "EzPC"]
+	path = third_party/EzPC/EzPC
+	url = https://github.com/mpc-msri/EzPC.git
+[submodule "fss-v0.7.0"]
+	path = third_party/fss-v0.7.0/fss-v0.7.0
+	url = https://github.com/myl7/fss.git
@@ -255,6 +255,12 @@ See `samples/dpf_dcf_gpu.cu` for the complete working example.
 
 You may see warnings like "integer constant is so large that it is unsigned" during compilation. These cannot be easily suppressed but are harmless and can be safely ignored.
 
+### nvcc 12.8: `Uint` as a `__global__` kernel template argument
+
+nvcc 12.8 fails to compile the stub file when `fss::group::Uint<__uint128_t, ...>` is used as a template argument to a `__global__` kernel — it emits a 128-bit integer literal that g++ cannot parse. `__device__` functions are not affected (no stub is generated for them).
+
+Workaround: wrap the type in a plain aggregate struct that satisfies `Groupable` but has no `__uint128_t` non-type template parameter in its name. The struct must have no user-declared constructors to remain an aggregate. See `third_party/fss/bench.cu` for an example.
+
 ## Benchmarks
 
 Microbenchmarks for DPF/DCF `Gen`/`Eval` using [Google Benchmark](https://github.com/google/benchmark), covering both CPU (AES-128 MMO PRG) and GPU (ChaCha PRG) paths.
 
@@ -2,62 +2,67 @@
 
 Benchmark comparing two software AES-128 implementations used as PRGs
 with Matyas-Meyer-Oseas (MMO) mode: `out = AES(key, seed) XOR seed`.
+Both use mul=2 (two outputs per call, matching DPF usage).
 
-- `fss::prg::Aes128Soft` (`include/fss/prg/aes128_mmo_soft.cuh`):
+- `fss::prg::Aes128Soft<2>` (`include/fss/prg/aes128_mmo_soft.cuh`):
   T-table optimization. Combines SubBytes + MixColumns into 4 uint32_t
   Te0 lookups per round. Tables (1024B Te0 + 256B sbox) in `__shared__`
   memory on GPU.
 
-- `torchcsprng::Aes128Mmo` (`third_party/torchcsprng/aes128_mmo_soft.cuh`):
+- `torchcsprng::Aes128Mmo<2>` (`third_party/torchcsprng/torchcsprng/aes128_mmo_soft.cuh`):
   Textbook byte-by-byte. Separate SubBytes (16 sbox lookups), ShiftRows
   (byte shuffles), and MixColumns (xtime per byte) per round. No lookup
   tables beyond the 256B sbox.
   Ported from [meta-pytorch/csprng](https://github.com/meta-pytorch/csprng).
 
 Both pre-expand round keys in the constructor (key setup cost excluded).
 
-Benchmark source: `third_party/bench_aes128_soft.cu`.
-Build: `cmake -S third_party -B build/third_party -DCMAKE_BUILD_TYPE=Release`.
+## Settings
+
+| Setting | `fss::prg::Aes128Soft<2>` | `torchcsprng::Aes128Mmo<2>` |
+|---------|--------------------------|------------------------------|
+| Algorithm | T-table: Te0[256] + sbox[256] | Textbook: sbox[256] only |
+| GPU tables | `__shared__` memory (1280B/block) | none |
+| Bench name | `fss/{CPU,GPU}/AesSoft` | `torchcsprng/{CPU,GPU}/AesSoft` |
+| Test (CPU) | DPF Eval (BytesGroup, in_bits=20) | raw PRG Gen (single call) |
+| Test (GPU) | DPF Eval (BytesGroup, in_bits=20), 2^20 parallel | raw PRG Gen, 2^20 parallel |
+| GPU threads | 256/block | 256/block |
+| Source | `third_party/fss/bench.cu` | `third_party/torchcsprng/bench.cu` |
+| Build | `cmake -S third_party/fss -B build/fss -DCMAKE_BUILD_TYPE=Release` | `cmake -S third_party/torchcsprng -B build/torchcsprng -DCMAKE_BUILD_TYPE=Release` |
 
 ## Hardware
 
 - GPU: NVIDIA A30 (sm_80), 24 GB HBM2e, driver 580.126.09
-- CPU: 2x AMD EPYC 7413 (96 threads total), 2.3 GHz base
+- CPU: 2x AMD EPYC 7352 (96 threads total), 2.3 GHz base
 - CUDA: 12.8, nvcc V12.8.93
 - Build flags: `-O3`, `CMAKE_CUDA_ARCHITECTURES=80`
 
 ## GPU Results
 
-1M parallel AES-128-MMO operations. CUDA event timing (5 reps, median).
+1M parallel AES-128-MMO raw PRG Gen operations. CUDA event timing (5 reps, median).
 
 | Benchmark | Time (ms) | Throughput | Speedup |
 |-----------|-----------|------------|---------|
-| Aes128Soft mul=1 | 0.784 | 1.337 G/s | 69.9x |
-| torchcsprng mul=1 | 54.82 | 19.13 M/s | 1x |
-| Aes128Soft mul=2 | 2.463 | 425.7 M/s | 44.1x |
-| torchcsprng mul=2 | 108.63 | 9.65 M/s | 1x |
+| Aes128Soft | 2.463 | 425.7 M/s | 44x |
+| torchcsprng | 108.63 | 9.65 M/s | 1x |
 
 ### GPU Resource Usage
 
 `nvcc -O3 -arch=sm_80 --ptxas-options=-v`:
 
 | Kernel | Regs/thread | Shared mem/block | Max occupancy |
 |--------|-------------|------------------|---------------|
-| Aes128Soft mul=1 | 72 | 1280 B | 896 threads/SM (43%) |
-| Aes128Soft mul=2 | 72 | 1280 B | 896 threads/SM (43%) |
-| torchcsprng mul=1 | 128 | 0 | 512 threads/SM (25%) |
-| torchcsprng mul=2 | 122 | 0 | 512 threads/SM (25%) |
+| Aes128Soft | 72 | 1280 B | 896 threads/SM (43%) |
+| torchcsprng | 122 | 0 | 512 threads/SM (25%) |
 
 ## CPU Results
 
 Single-threaded latency on the same machine (5 reps, median).
 
 | Benchmark | Time/op | Throughput | Speedup |
 |-----------|---------|------------|---------|
-| Aes128Soft mul=1 | 70.7 ns | 14.15 M/s | 2.91x |
-| torchcsprng mul=1 | 206 ns | 4.86 M/s | 1x |
-| Aes128Soft mul=2 | 140 ns | 7.13 M/s | 3.00x |
-| torchcsprng mul=2 | 421 ns | 2.38 M/s | 1x |
+| Aes128Soft | 140 ns | 7.13 M/s | 3.0x |
+| torchcsprng | 421 ns | 2.38 M/s | 1x |
 
 ## Analysis
 
@@ -66,11 +71,11 @@ Single-threaded latency on the same machine (5 reps, median).
    separate sbox lookups + 4 xtime calls + 12 XORs per column.
 
 2. On GPU, register pressure is the main factor. The textbook approach
-   uses 122-128 registers per thread, capping occupancy at 25% (512
+   uses 122 registers per thread, capping occupancy at 25% (512
    threads/SM). The T-table variant uses 72 registers, allowing 43%
    occupancy and better latency hiding.
 
 3. The 1280B shared memory cost for Te0 + sbox is negligible (A30 has
    164 KB shared memory per SM).
 
-Aes128Soft is 3x faster on CPU and 44-70x faster on GPU.
+Aes128Soft is 3x faster on CPU and 44x faster on GPU.