Fix poisson sampling #122

MaheshThakur9152 · 2026-01-24T04:23:17Z

Summary

This PR fixes the implementation of CyclicPoissonSampling.batch_iterator to strictly align with the canonical definition of Poisson sampling used in Differential Privacy (independent Bernoulli trials).

Motivation

The previous implementation used a two-step process:

Sample a batch size $k \sim \text{Binomial}(n, p)$.
Select $k$ items uniformly without replacement.

While this yields the correct marginal probability, it introduces negative dependence between items in the same batch (selecting one item reduces the probability of others being selected). This violation of independence can invalidate privacy accounting assumptions (e.g., RDP amplification by sampling).

Changes

Logic Update: Refactored batch_selection.py to use a vectorized Bernoulli mask (rng.random() < p). This guarantees that item inclusions are statistically independent.
New Test: Added test_poisson_sampling_marginal_and_pairwise to tests/batch_selection_test.py. This test statistically validates that:
- Marginal inclusion probability $\approx p$.
- Pairwise inclusion probability $\approx p^2$ (confirming independence).

Verification

Ran the new statistical test test_poisson_sampling_marginal_and_pairwise, which now passes (previously failed with negative dependence).
Verified that existing tests pass to ensure no regressions.

Related Issues

Fixes #121

…mpling; add statistical tests

…endent Bernoulli semantics

amyssnippet

while this fix aligns with the theoretical definition of poisson sampling, using a boolean mask creates dynamic shapes how does this impact performance when the batch_iterator is used inside a jax.jit decorated training loop

amyssnippet · 2026-01-24T08:35:23Z

jax_privacy/batch_selection.py

+      # items (this preserves independent selection semantics before
+      # truncation).
+      if (
+          self.truncated_batch_size is not None


Does the truncation logic negate the privacy benefits of the independent bernoulli trials?

It does but it still provides benefits, see https://arxiv.org/abs/2508.15089

amyssnippet · 2026-01-24T08:39:17Z

these changes might respect the pov of math, but it might break the code performance

can you check and do a stress test of this for a sample model training, and the performance diff, before and after the changes??

… tests for padding

amyssnippet

maybe some pylint errors to solve in tests/batch_selection_test.py:239:0

amyssnippet

LGTM

amyssnippet · 2026-01-24T10:03:17Z

@ryan112358 take a loot at this, as you are a veteran to the project

MaheshThakur9152 · 2026-01-26T15:59:31Z

@ryan112358 can you please review the changes

ryan112358

Thank you for the contribution, and for critically evaluating the implementations of our core components. This is super crucial work to harden our library for real privacy applications where correctness is the first priority. I will assign a member of our team to take a closer look. But a couple high-level comments:

Producing variable batch sizes is expected and fine here since we are working with pure numpy in this file. The batches can later be padded to a fixed set of sizes for efficiency under jax.jit.
The current implementation is more efficient though, since it has O(B) complexity per step rather than O(N).
[minor] Being able to know the batch sizes in advance, before sampling the exact examples could be useful downstream applications so we can e.g., pre compile our train step for the batch sizes we expect to see.

Of course, these points are moot if the current implementation is actually incorrect, so we will get back to you on that.

MaheshThakur9152 · 2026-01-27T02:33:05Z

Thanks @ryan112358 for the review and the context.

I agree that the shift from O(B) to O(N) is a performance cost, but as you noted, it seems unavoidable to satisfy the strict independence requirement of Poisson sampling. The previous method's negative dependence effectively leaked information between samples, which risks invalidating RDP accounting.

Regarding the point on pre-compilation: since the batch size is inherently stochastic in Poisson sampling, we lose the ability to know it deterministically in advance, but the fixed_batch_size option I added should help bridge that gap for downstream JIT usage.

arung54 · 2026-01-27T02:50:52Z

Hi, I ran your test_poisson_sampling_marginal_and_pairwise without changing any of the code outside the test file and it did not fail. I'm not convinced the current implementation is incorrect.

The events that two different examples are sampled are pairwise dependent conditioned on knowing the value of k, but without conditioning they remain independent. See Lemma 1 of https://arxiv.org/pdf/2406.17298v3 for a proof that the current implementation is distributionally equivalent to Poisson sampling. We will add a reference to this proof as a comment for clarity.

Feel free to send a PR adding the test and also adding the option to pad the batch (though I think the fixed_batch_size and truncated_batch_size args are somewhat redundant here). Or, if you disagree with the above paper or think the current implementation has a bug that is not inherent to the sampling technique, feel free to reopen.

MaheshThakur9152 · 2026-01-27T03:07:02Z

@arung54 Understood. I reviewed the lemma and see how the variance in k cancels the negative dependence. My baseline test configuration must have been incorrect.

MaheshThakur9152 added 5 commits January 24, 2026 08:43

batch_selection: use independent Bernoulli trials for CyclicPoissonSa…

2cd26e0

…mpling; add statistical tests

Add statistical tests for Poisson sampling

e931360

Fix pylint line length violations

68ce9d6

batch_selection: update doctest for Poisson sampling to reflect indep…

ad95930

…endent Bernoulli semantics

Update BandMF doctest expected output

9a00363

amyssnippet reviewed Jan 24, 2026

View reviewed changes

batch_selection: add fixed_batch_size option for fixed-shape batches;…

8900824

… tests for padding

MaheshThakur9152 requested a review from amyssnippet January 24, 2026 08:53

amyssnippet suggested changes Jan 24, 2026

View reviewed changes

tests: remove superfluous parentheses in generator 'if' condition

1d44745

MaheshThakur9152 requested a review from amyssnippet January 24, 2026 09:15

amyssnippet approved these changes Jan 24, 2026

View reviewed changes

MaheshThakur9152 requested a review from amyssnippet January 24, 2026 10:02

amyssnippet approved these changes Jan 25, 2026

View reviewed changes

ryan112358 reviewed Jan 27, 2026

View reviewed changes

arung54 closed this Jan 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix poisson sampling #122

Fix poisson sampling #122

Uh oh!

MaheshThakur9152 commented Jan 24, 2026

Uh oh!

amyssnippet left a comment

Uh oh!

amyssnippet Jan 24, 2026

Uh oh!

arung54 Jan 27, 2026

Uh oh!

amyssnippet commented Jan 24, 2026

Uh oh!

amyssnippet left a comment

Uh oh!

amyssnippet left a comment

Uh oh!

amyssnippet commented Jan 24, 2026

Uh oh!

MaheshThakur9152 commented Jan 26, 2026

Uh oh!

ryan112358 left a comment •

edited

Loading

Uh oh!

MaheshThakur9152 commented Jan 27, 2026

Uh oh!

arung54 commented Jan 27, 2026 •

edited

Loading

Uh oh!

MaheshThakur9152 commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix poisson sampling #122

Fix poisson sampling #122

Uh oh!

Conversation

MaheshThakur9152 commented Jan 24, 2026

Summary

Motivation

Changes

Verification

Related Issues

Uh oh!

amyssnippet left a comment

Choose a reason for hiding this comment

Uh oh!

amyssnippet Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

arung54 Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

amyssnippet commented Jan 24, 2026

Uh oh!

amyssnippet left a comment

Choose a reason for hiding this comment

Uh oh!

amyssnippet left a comment

Choose a reason for hiding this comment

Uh oh!

amyssnippet commented Jan 24, 2026

Uh oh!

MaheshThakur9152 commented Jan 26, 2026

Uh oh!

ryan112358 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaheshThakur9152 commented Jan 27, 2026

Uh oh!

arung54 commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaheshThakur9152 commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ryan112358 left a comment •

edited

Loading

arung54 commented Jan 27, 2026 •

edited

Loading