Skip to content

Commit d17a4b1

Browse files
committed
Add 3rd party benchmarks
1 parent 5843bad commit d17a4b1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

54 files changed

+5709
-231
lines changed

.gitignore

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,9 @@
33
.idea
44

55
# Build outputs
6-
/build
7-
/cmake-build-*
6+
build
7+
cmake-build-*
8+
target
89

910
# Doc outputs
1011
/doc/html
@@ -18,3 +19,6 @@ superpowers
1819
# Benchmarks
1920
perf.data
2021
perf.data.old
22+
23+
# Bazel symlinks
24+
/third_party/distributed_point_functions/bazel-*

.gitmodules

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
[submodule "libfss"]
2+
path = third_party/libfss/libfss
3+
url = https://github.com/frankw2/libfss.git
4+
[submodule "libdpf"]
5+
path = third_party/libdpf
6+
url = https://github.com/weikengchen/libdpf.git
7+
[submodule "distributed_point_functions"]
8+
path = third_party/distributed_point_functions/distributed_point_functions
9+
url = https://github.com/google/distributed_point_functions.git
10+
[submodule "GPU-DPF"]
11+
path = third_party/GPU-DPF/GPU-DPF
12+
url = https://github.com/facebookresearch/GPU-DPF.git
13+
[submodule "EzPC"]
14+
path = third_party/EzPC/EzPC
15+
url = https://github.com/mpc-msri/EzPC.git
16+
[submodule "fss-v0.7.0"]
17+
path = third_party/fss-v0.7.0/fss-v0.7.0
18+
url = https://github.com/myl7/fss.git

README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -255,6 +255,12 @@ See `samples/dpf_dcf_gpu.cu` for the complete working example.
255255

256256
You may see warnings like "integer constant is so large that it is unsigned" during compilation. These cannot be easily suppressed but are harmless and can be safely ignored.
257257

258+
### nvcc 12.8: `Uint` as a `__global__` kernel template argument
259+
260+
nvcc 12.8 fails to compile the stub file when `fss::group::Uint<__uint128_t, ...>` is used as a template argument to a `__global__` kernel — it emits a 128-bit integer literal that g++ cannot parse. `__device__` functions are not affected (no stub is generated for them).
261+
262+
Workaround: wrap the type in a plain aggregate struct that satisfies `Groupable` but has no `__uint128_t` non-type template parameter in its name. The struct must have no user-declared constructors to remain an aggregate. See `third_party/fss/bench.cu` for an example.
263+
258264
## Benchmarks
259265

260266
Microbenchmarks for DPF/DCF `Gen`/`Eval` using [Google Benchmark](https://github.com/google/benchmark), covering both CPU (AES-128 MMO PRG) and GPU (ChaCha PRG) paths.

doc/bench_aes128_soft.md

Lines changed: 25 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -2,62 +2,67 @@
22

33
Benchmark comparing two software AES-128 implementations used as PRGs
44
with Matyas-Meyer-Oseas (MMO) mode: `out = AES(key, seed) XOR seed`.
5+
Both use mul=2 (two outputs per call, matching DPF usage).
56

6-
- `fss::prg::Aes128Soft` (`include/fss/prg/aes128_mmo_soft.cuh`):
7+
- `fss::prg::Aes128Soft<2>` (`include/fss/prg/aes128_mmo_soft.cuh`):
78
T-table optimization. Combines SubBytes + MixColumns into 4 uint32_t
89
Te0 lookups per round. Tables (1024B Te0 + 256B sbox) in `__shared__`
910
memory on GPU.
1011

11-
- `torchcsprng::Aes128Mmo` (`third_party/torchcsprng/aes128_mmo_soft.cuh`):
12+
- `torchcsprng::Aes128Mmo<2>` (`third_party/torchcsprng/torchcsprng/aes128_mmo_soft.cuh`):
1213
Textbook byte-by-byte. Separate SubBytes (16 sbox lookups), ShiftRows
1314
(byte shuffles), and MixColumns (xtime per byte) per round. No lookup
1415
tables beyond the 256B sbox.
1516
Ported from [meta-pytorch/csprng](https://github.com/meta-pytorch/csprng).
1617

1718
Both pre-expand round keys in the constructor (key setup cost excluded).
1819

19-
Benchmark source: `third_party/bench_aes128_soft.cu`.
20-
Build: `cmake -S third_party -B build/third_party -DCMAKE_BUILD_TYPE=Release`.
20+
## Settings
21+
22+
| Setting | `fss::prg::Aes128Soft<2>` | `torchcsprng::Aes128Mmo<2>` |
23+
|---------|--------------------------|------------------------------|
24+
| Algorithm | T-table: Te0[256] + sbox[256] | Textbook: sbox[256] only |
25+
| GPU tables | `__shared__` memory (1280B/block) | none |
26+
| Bench name | `fss/{CPU,GPU}/AesSoft` | `torchcsprng/{CPU,GPU}/AesSoft` |
27+
| Test (CPU) | DPF Eval (BytesGroup, in_bits=20) | raw PRG Gen (single call) |
28+
| Test (GPU) | DPF Eval (BytesGroup, in_bits=20), 2^20 parallel | raw PRG Gen, 2^20 parallel |
29+
| GPU threads | 256/block | 256/block |
30+
| Source | `third_party/fss/bench.cu` | `third_party/torchcsprng/bench.cu` |
31+
| Build | `cmake -S third_party/fss -B build/fss -DCMAKE_BUILD_TYPE=Release` | `cmake -S third_party/torchcsprng -B build/torchcsprng -DCMAKE_BUILD_TYPE=Release` |
2132

2233
## Hardware
2334

2435
- GPU: NVIDIA A30 (sm_80), 24 GB HBM2e, driver 580.126.09
25-
- CPU: 2x AMD EPYC 7413 (96 threads total), 2.3 GHz base
36+
- CPU: 2x AMD EPYC 7352 (96 threads total), 2.3 GHz base
2637
- CUDA: 12.8, nvcc V12.8.93
2738
- Build flags: `-O3`, `CMAKE_CUDA_ARCHITECTURES=80`
2839

2940
## GPU Results
3041

31-
1M parallel AES-128-MMO operations. CUDA event timing (5 reps, median).
42+
1M parallel AES-128-MMO raw PRG Gen operations. CUDA event timing (5 reps, median).
3243

3344
| Benchmark | Time (ms) | Throughput | Speedup |
3445
|-----------|-----------|------------|---------|
35-
| Aes128Soft mul=1 | 0.784 | 1.337 G/s | 69.9x |
36-
| torchcsprng mul=1 | 54.82 | 19.13 M/s | 1x |
37-
| Aes128Soft mul=2 | 2.463 | 425.7 M/s | 44.1x |
38-
| torchcsprng mul=2 | 108.63 | 9.65 M/s | 1x |
46+
| Aes128Soft | 2.463 | 425.7 M/s | 44x |
47+
| torchcsprng | 108.63 | 9.65 M/s | 1x |
3948

4049
### GPU Resource Usage
4150

4251
`nvcc -O3 -arch=sm_80 --ptxas-options=-v`:
4352

4453
| Kernel | Regs/thread | Shared mem/block | Max occupancy |
4554
|--------|-------------|------------------|---------------|
46-
| Aes128Soft mul=1 | 72 | 1280 B | 896 threads/SM (43%) |
47-
| Aes128Soft mul=2 | 72 | 1280 B | 896 threads/SM (43%) |
48-
| torchcsprng mul=1 | 128 | 0 | 512 threads/SM (25%) |
49-
| torchcsprng mul=2 | 122 | 0 | 512 threads/SM (25%) |
55+
| Aes128Soft | 72 | 1280 B | 896 threads/SM (43%) |
56+
| torchcsprng | 122 | 0 | 512 threads/SM (25%) |
5057

5158
## CPU Results
5259

5360
Single-threaded latency on the same machine (5 reps, median).
5461

5562
| Benchmark | Time/op | Throughput | Speedup |
5663
|-----------|---------|------------|---------|
57-
| Aes128Soft mul=1 | 70.7 ns | 14.15 M/s | 2.91x |
58-
| torchcsprng mul=1 | 206 ns | 4.86 M/s | 1x |
59-
| Aes128Soft mul=2 | 140 ns | 7.13 M/s | 3.00x |
60-
| torchcsprng mul=2 | 421 ns | 2.38 M/s | 1x |
64+
| Aes128Soft | 140 ns | 7.13 M/s | 3.0x |
65+
| torchcsprng | 421 ns | 2.38 M/s | 1x |
6166

6267
## Analysis
6368

@@ -66,11 +71,11 @@ Single-threaded latency on the same machine (5 reps, median).
6671
separate sbox lookups + 4 xtime calls + 12 XORs per column.
6772

6873
2. On GPU, register pressure is the main factor. The textbook approach
69-
uses 122-128 registers per thread, capping occupancy at 25% (512
74+
uses 122 registers per thread, capping occupancy at 25% (512
7075
threads/SM). The T-table variant uses 72 registers, allowing 43%
7176
occupancy and better latency hiding.
7277

7378
3. The 1280B shared memory cost for Te0 + sbox is negligible (A30 has
7479
164 KB shared memory per SM).
7580

76-
Aes128Soft is 3x faster on CPU and 44-70x faster on GPU.
81+
Aes128Soft is 3x faster on CPU and 44x faster on GPU.

0 commit comments

Comments
 (0)