-
Notifications
You must be signed in to change notification settings - Fork 36
Multi-User Contribution Bounding for shared ownership datasets (#88) Split PR 2 #108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Multi-User Contribution Bounding for shared ownership datasets (#88) Split PR 2 #108
Conversation
…es (Ganesh et al., 2025)
| from jax_privacy.batch_selection import CyclicPoissonSampling | ||
|
|
||
|
|
||
| def get_safe_indices(user_mapping, k_limit): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to add types here
| # limitations under the License. | ||
| """Example of data selection for datasets with multi-user attribution. | ||
|
|
||
| Literature Reference: Ganesh et al. (2025), "It’s My Data Too: Private ML for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a direct link to https://arxiv.org/abs/2503.03622
|
|
||
| Args: | ||
| user_mapping: A dictionary mapping an example index to a list of its owners. | ||
| k_limit: The maximum number of times any single user can contribute data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a check that this is >= 1
| # This multi-pass loop implements the "Greedy with Duplicates" strategy. | ||
| # By allowing duplicates in multiple passes, we increase the total sample | ||
| # size N, which significantly improves the signal-to-noise ratio. | ||
| while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: In pass i > 1, any elements added must have already been added in pass i - 1. This means there are two ways to speed up the code:
-The algorithm needs at most k_limit passes (and you could make it more efficient by preventing it from making the (k_limit + 1)-th pass)
-You could remove an example from sorted_indices if it's not able to added (so that future passes over sorted_indices are shorter)
Since this is an example optimizing the runtime isn't super important, but it could be nice.
| return safe_indices | ||
|
|
||
|
|
||
| def generate_dummy_data(num_examples, num_users, max_owners): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add types here
| print(f"\nRunning selection logic with k_limit = {k_limit}...") | ||
| safe_indices = get_safe_indices(user_mapping, k_limit) | ||
|
|
||
| # --- 3. Mandatory Correctness Verification --- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the algorithm is implemented correctly this block should be unnecessary; I'd recommend deleting (this might be reasonable to e.g. include in a test file, but I don't think we need it here)
| ) | ||
|
|
||
| # Initialize a simple Flax model and optimizer | ||
| model = SimpleDenseNet() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are not actually going to compute gradients, what is the need to even define and initialize the model? I would either delete the code relating to the model for clarity, or try to implement actual DP optimization (e.g. using https://github.com/google-deepmind/jax_privacy/blob/main/examples/dp_logistic_regression.py as a template)
This Pull Request introduces a standalone example script demonstrating Multi-User Contribution Bounding, as detailed in Research Paper
The example addresses the "multi-attribution" problem where single records (such as group chat messages or co-authored research papers) are owned by multiple users simultaneously. To provide a rigorous privacy guarantee for all owners, we must ensure no single user’s data is over-represented in the final training set1.
Implementation Highlights:
✔️ Greedy Bounding Algorithm: I have implemented the Greedy with Duplicates strategy (Algorithm 3 from the paper)22. This approach prioritizes examples with lower cardinality (fewer owners) to maximize dataset utility while strictly adhering to a per-user contribution limit ($k$ )33.+1$N$ )4444.+1
✔️ Duplicate Handling: Following the paper's findings that noise reduction often outweighs sampling bias in high-dimensional training, the logic allows for multiple passes over the dataset to maximize the sample size (
✔️ Poisson Integration: The script demonstrates how to integrate this pre-processing step with a standard DP training loop using CyclicPoissonSampling.
✔️ Correctness Verification: To verify the mathematical integrity of the selection process, I included the following automated checks in the script: Strict Bound Enforcement: An explicit assertion block verifies the final selection against the user_mapping. The log confirms: [Verification] Verified k_limit check passed.
Max user contribution: 15.
Utilization Metrics: The script reports the Utilization Rate, demonstrating that the "Duplicates" strategy successfully retrieves more than 100% of the unique indices in high-budget scenarios, significantly improving the signal-to-noise ratio for the DP optimizer5555.