Skip to content

Commit f237fa0

Browse files
Josu San MartinJosu San Martin
authored andcommitted
Add security review for in-process result authentication
1 parent 283c9b6 commit f237fa0

File tree

6 files changed

+235
-8
lines changed

6 files changed

+235
-8
lines changed

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,3 +75,7 @@ again to minimize the window of opportunity for cheating by writing results from
7575
small effect on performance, as during the tail of the user kernel blocks of the test kernel are already put on the SMs
7676
and generate memory traffic. In the checking kernel, the order in which blocks are checked is randomized, so that it is
7777
not a viable strategy to only write the later blocks of the result from an unsynchronized stream.
78+
79+
## Security Review
80+
81+
A repository-level security note for the remaining in-process trust-boundary issues is documented in [SECURITY_REVIEW.md](SECURITY_REVIEW.md).

SECURITY_REVIEW.md

Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
# Security Review: Result Authentication and In-Process Secret Exposure
2+
3+
## Summary
4+
5+
`pygpubench` is a substantial improvement over a pure-Python evaluator, but it still keeps trusted benchmark state inside the same process as untrusted submission code.
6+
7+
That remaining trust-boundary issue appears to allow a malicious submission to forge benchmark output without running the intended kernel, by recovering the child-process result-authentication secret from memory and writing attacker-controlled results into the inherited result channel before the real benchmark loop completes.
8+
9+
This document is intentionally disclosure-oriented:
10+
11+
- it does **not** include exploit payloads
12+
- it does **not** include step-by-step reproduction code
13+
- it focuses on the issue, impact, and remediation options
14+
15+
## Main Finding
16+
17+
The current isolated benchmarking path still relies on a secret known to the worker process itself.
18+
19+
At a high level:
20+
21+
1. the parent process creates a benchmark subprocess
22+
2. a result-authentication secret is delivered to the child
23+
3. the child stores that secret in process memory
24+
4. untrusted Python submission code runs in that same process
25+
5. the child still has access to the result channel used to report benchmark results
26+
27+
That means the worker can potentially:
28+
29+
- recover the secret from its own address space
30+
- emit forged benchmark results carrying the correct authentication material
31+
- cause the parent to accept attacker-controlled timings
32+
33+
In other words, the integrity of the result channel still depends on trusting code inside the process being benchmarked.
34+
35+
## Why This Matters
36+
37+
If the child process can authenticate arbitrary forged results, then:
38+
39+
- reported timings no longer prove that the benchmarked kernel actually ran
40+
- reported error counts no longer prove that correctness checks actually passed
41+
- benchmark output can be made arbitrarily small or otherwise attacker-controlled
42+
43+
This is a benchmark-integrity failure, not just a local implementation bug.
44+
45+
## Additional Weaknesses Observed
46+
47+
The signature/authentication issue is the primary concern. Several secondary findings make exploitation easier or increase future risk:
48+
49+
### 1. GC-visible benchmark metadata
50+
51+
At import time, Python-visible objects can still reveal useful benchmark structure such as:
52+
53+
- number of repeats
54+
- output tensor metadata
55+
- tolerance information
56+
57+
Even if tensor payloads are protected better than before, metadata leakage reduces attacker uncertainty.
58+
59+
### 2. Warmup predictability
60+
61+
If warmup always uses a deterministic case or stable pointer pattern, a malicious kernel may distinguish warmup from measured iterations and adapt its behavior accordingly.
62+
63+
### 3. NaN wildcard handling
64+
65+
Any checker behavior that treats NaN in expected data as “accept anything” is dangerous. Even if not immediately exploitable through the current path, it creates a latent bypass if expected-output addresses or copies become observable later.
66+
67+
### 4. Overly broad in-process capability
68+
69+
Untrusted Python code still runs with:
70+
71+
- arbitrary `ctypes` access
72+
- process-memory visibility
73+
- inherited file descriptors / pipes
74+
- normal Python runtime introspection
75+
76+
That combination is enough to make “secret inside the same process” a weak design.
77+
78+
## What `pygpubench` Already Improves
79+
80+
This report should not obscure the fact that `pygpubench` already fixes important problems that affect naive Python evaluators.
81+
82+
Compared to an in-process pure-Python benchmark harness, `pygpubench` materially improves resistance against:
83+
84+
- Python monkeypatching of timer objects
85+
- direct patching of Python reference/evaluator functions
86+
- trivial caching of user-visible tensors
87+
- some stream-ordering and L2-cache based reward hacking
88+
89+
So the right framing is:
90+
91+
- the architecture is **better**
92+
- but the remaining secret/result-channel design is still not strong enough for adversarial benchmarking
93+
94+
## Root Cause
95+
96+
The benchmark subprocess is simultaneously:
97+
98+
- the environment running untrusted code
99+
- the holder of trusted result-authentication state
100+
101+
As long as the child both:
102+
103+
1. possesses the authentication material, and
104+
2. can write to the channel accepted by the parent,
105+
106+
the scheme is vulnerable in principle.
107+
108+
The security model still assumes the worker can be trusted with some benchmark-control state. In an adversarial benchmark, it cannot.
109+
110+
## Recommended Fixes
111+
112+
## 1. Do not keep the authentication secret alive inside untrusted execution
113+
114+
Any key, signature, token, or HMAC material used to authenticate results should not remain recoverable after untrusted Python code starts executing.
115+
116+
At minimum:
117+
118+
- generate the key in trusted code
119+
- consume it before importing the submission
120+
- explicitly overwrite it in the child
121+
122+
This reduces the straightforward memory-recovery path.
123+
124+
## 2. Move result authentication to a trusted boundary
125+
126+
A stronger fix is to ensure that the worker process never has the ability to authenticate arbitrary forged results.
127+
128+
Two good directions:
129+
130+
- trusted validator/orchestrator process owns the authentication
131+
- worker emits only raw events/results, never authenticated final records
132+
133+
The parent should authenticate data that the worker cannot forge by construction.
134+
135+
## 3. Reduce in-process attack surface
136+
137+
Possible hardening measures:
138+
139+
- restrict inherited file descriptors
140+
- tighten seccomp / syscall policy where practical
141+
- minimize procfs visibility where practical
142+
- reduce or remove unnecessary writable/redirectable channels in the worker
143+
144+
These are secondary mitigations, not substitutes for a correct trust boundary.
145+
146+
## 4. Remove or minimize metadata leakage
147+
148+
Before importing the submission:
149+
150+
- drop Python references that are no longer needed
151+
- clean up transient objects that reveal benchmark layout
152+
- avoid keeping expected-output metadata visible through ordinary Python object traversal
153+
154+
## 5. Randomize warmup and pre-measurement behavior
155+
156+
Avoid deterministic warmup patterns that allow the kernel to distinguish:
157+
158+
- warmup
159+
- estimated timing
160+
- real benchmark passes
161+
162+
## 6. Fail closed on suspicious checker states
163+
164+
Examples:
165+
166+
- NaN in expected outputs should be treated as a benchmark-generation failure, not a wildcard
167+
- malformed or incomplete child output should fail hard
168+
- any mismatch in result structure should fail hard
169+
170+
## Suggested Architectural Direction
171+
172+
### Option A: Trusted validator split
173+
174+
Use a three-role model:
175+
176+
- orchestrator: trusted, owns benchmark policy and result acceptance
177+
- worker: untrusted, only runs the kernel
178+
- validator: trusted, checks correctness and/or authenticates the final result
179+
180+
The worker should not be able to independently produce a parent-acceptable final benchmark record.
181+
182+
### Option B: Transitional hardening
183+
184+
If a full redesign is not immediately feasible:
185+
186+
1. remove recoverable result-authentication state before user import
187+
2. aggressively reduce inherited descriptors/capabilities
188+
3. clear Python-visible benchmark metadata before import
189+
4. add tamper detection for result-channel anomalies
190+
191+
This would still be weaker than a proper split-process trust model, but materially better than the current design.
192+
193+
## Recommended Next Steps
194+
195+
1. treat this as a security bug affecting adversarial benchmark integrity
196+
2. review the result-authentication path end-to-end
197+
3. patch the child secret lifetime / ownership problem first
198+
4. then follow up with capability reduction and metadata cleanup
199+
5. publish a brief security note once the fix lands
200+
201+
## Scope Of This Document
202+
203+
This file is intended to support remediation planning inside the repository.
204+
205+
It does **not**:
206+
207+
- provide exploit code
208+
- attribute methods to any third party
209+
- claim that every theoretical path has been weaponized
210+
211+
It only records that the current result-authentication design remains vulnerable because trusted benchmark state is still exposed inside the untrusted worker process.

csrc/binding.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ void do_bench(int result_fd, int input_fd, const std::string& kernel_qualname, c
1515
auto config = read_benchmark_parameters(input_fd);
1616
BenchmarkManager mgr(result_fd, config.Signature, config.Seed, discard, nvtx, landlock);
1717
auto [args, expected] = mgr.setup_benchmark(nb::cast<nb::callable>(test_generator), test_kwargs, config.Repeats);
18-
mgr.do_bench_py(kernel_qualname, args, expected, reinterpret_cast<cudaStream_t>(stream));
18+
mgr.do_bench_py(kernel_qualname, args, std::move(expected), reinterpret_cast<cudaStream_t>(stream));
1919
}
2020

2121

csrc/check.cu

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -41,9 +41,12 @@ __global__ void check_approx_match_kernel(unsigned* result, const Float* expecte
4141
cuda::atomic_ref<unsigned, cuda::thread_scope_device> res(*result);
4242
float a = static_cast<float>(expected[idx]);
4343

44-
// Nan is expected is wildcard for arbitrary results
45-
if (isnan(a))
44+
// NaN in the expected output must not behave as a wildcard.
45+
// Treat it as a mismatch so malformed expectations fail closed.
46+
if (isnan(a)) {
47+
++res;
4648
return;
49+
}
4750

4851
#if __CUDA_ARCH__ >= 900
4952
cudaGridDependencySynchronize();
@@ -180,4 +183,4 @@ void canaries(void* data, size_t size, unsigned seed, cudaStream_t stream) {
180183
int grid_size = max_blocks * num_sms;
181184
void *pArgs[] = { &data, &size, &seed};
182185
CUDA_CHECK(cudaLaunchCooperativeKernel(&canaries_kernel, grid_size, block_size, pArgs, smem, stream));
183-
}
186+
}

csrc/manager.cpp

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -240,7 +240,7 @@ BenchmarkManager::ShadowArgument& BenchmarkManager::ShadowArgument::operator=(Sh
240240
return *this;
241241
}
242242

243-
void BenchmarkManager::do_bench_py(const std::string& kernel_qualname, const std::vector<nb::tuple>& args, const std::vector<nb::tuple>& expected, cudaStream_t stream) {
243+
void BenchmarkManager::do_bench_py(const std::string& kernel_qualname, const std::vector<nb::tuple>& args, std::vector<nb::tuple> expected, cudaStream_t stream) {
244244
if (args.size() < 5) {
245245
throw std::runtime_error("Not enough test cases to run benchmark");
246246
}
@@ -284,6 +284,11 @@ void BenchmarkManager::do_bench_py(const std::string& kernel_qualname, const std
284284
}
285285
}
286286

287+
// The benchmark loop only needs the unmanaged output copies after this point.
288+
// Release Python-held expected tuples before importing untrusted code.
289+
expected.clear();
290+
expected.shrink_to_fit();
291+
287292
// clean up as much python state as we can
288293
trigger_gc();
289294

@@ -295,10 +300,14 @@ void BenchmarkManager::do_bench_py(const std::string& kernel_qualname, const std
295300
// after this, we cannot trust python anymore
296301
nb::callable kernel = kernel_from_qualname(kernel_qualname);
297302

303+
std::random_device warmup_rd;
304+
std::mt19937 warmup_rng(warmup_rd());
305+
std::uniform_int_distribution<int> warmup_dist(0, static_cast<int>(args.size()) - 1);
306+
298307
// ok, first run for compilations etc
299308
nvtx_push("warmup");
300309
CUDA_CHECK(cudaDeviceSynchronize());
301-
kernel(*args.at(0));
310+
kernel(*args.at(warmup_dist(warmup_rng)));
302311
CUDA_CHECK(cudaDeviceSynchronize());
303312
nvtx_pop();
304313

@@ -312,7 +321,7 @@ void BenchmarkManager::do_bench_py(const std::string& kernel_qualname, const std
312321
// this is only potentially problematic for in-place kernels;
313322
CUDA_CHECK(cudaDeviceSynchronize());
314323
clear_cache(stream);
315-
kernel(*args.at(0));
324+
kernel(*args.at(warmup_dist(warmup_rng)));
316325
CUDA_CHECK(cudaDeviceSynchronize());
317326
std::chrono::high_resolution_clock::time_point cpu_end = std::chrono::high_resolution_clock::now();
318327
std::chrono::duration<double> elapsed_seconds = cpu_end - cpu_start;

csrc/manager.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ class BenchmarkManager {
3131
BenchmarkManager(int result_fd, std::string signature, std::uint64_t seed, bool discard, bool nvtx, bool landlock);
3232
~BenchmarkManager();
3333
std::pair<std::vector<nb::tuple>, std::vector<nb::tuple>> setup_benchmark(const nb::callable& generate_test_case, const nb::dict& kwargs, int repeats);
34-
void do_bench_py(const std::string& kernel_qualname, const std::vector<nb::tuple>& args, const std::vector<nb::tuple>& expected, cudaStream_t stream);
34+
void do_bench_py(const std::string& kernel_qualname, const std::vector<nb::tuple>& args, std::vector<nb::tuple> expected, cudaStream_t stream);
3535
private:
3636
struct Expected {
3737
enum EMode {

0 commit comments

Comments
 (0)