Multi-User Contribution Bounding for shared ownership datasets (#88) Split PR 2 #108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

debanganghosh08 wants to merge 1 commit into google-deepmind:main from debanganghosh08:multi-user-attribution-88

debanganghosh08 commented Jan 20, 2026

This Pull Request introduces a standalone example script demonstrating Multi-User Contribution Bounding, as detailed in Research Paper

The example addresses the "multi-attribution" problem where single records (such as group chat messages or co-authored research papers) are owned by multiple users simultaneously. To provide a rigorous privacy guarantee for all owners, we must ensure no single user’s data is over-represented in the final training set1.

Implementation Highlights:

✔️ Greedy Bounding Algorithm: I have implemented the Greedy with Duplicates strategy (Algorithm 3 from the paper)22. This approach prioritizes examples with lower cardinality (fewer owners) to maximize dataset utility while strictly adhering to a per-user contribution limit ($k$)33.+1
✔️ Duplicate Handling: Following the paper's findings that noise reduction often outweighs sampling bias in high-dimensional training, the logic allows for multiple passes over the dataset to maximize the sample size ($N$)4444.+1
✔️ Poisson Integration: The script demonstrates how to integrate this pre-processing step with a standard DP training loop using CyclicPoissonSampling.
✔️ Correctness Verification: To verify the mathematical integrity of the selection process, I included the following automated checks in the script: Strict Bound Enforcement: An explicit assertion block verifies the final selection against the user_mapping. The log confirms: [Verification] Verified k_limit check passed.
Max user contribution: 15.

Utilization Metrics: The script reports the Utilization Rate, demonstrating that the "Duplicates" strategy successfully retrieves more than 100% of the unique indices in high-budget scenarios, significantly improving the signal-to-noise ratio for the DP optimizer5555.


          Implement Multi-User Contribution Bounding using Greedy with Duplicat…

506ac52

…es (Ganesh et al., 2025)

debanganghosh08 changed the title ~~Multi-User Contribution Bounding for shared ownership datasets (#88)~~ [Example] Multi-User Contribution Bounding for shared ownership datasets (#88) Split PR 2

debanganghosh08 changed the title ~~[Example] Multi-User Contribution Bounding for shared ownership datasets (#88) Split PR 2~~ Multi-User Contribution Bounding for shared ownership datasets (#88) Split PR 2

arung54 requested changes

View reviewed changes

examples/multi_user_attribution_example.py

		from jax_privacy.batch_selection import CyclicPoissonSampling


		def get_safe_indices(user_mapping, k_limit):

Collaborator

arung54 Jan 21, 2026

Would be good to add types here

examples/multi_user_attribution_example.py

+              # limitations under the License.
+              """Example of data selection for datasets with multi-user attribution.
+              Literature Reference: Ganesh et al. (2025), "It’s My Data Too: Private ML for

Collaborator

arung54 Jan 21, 2026

Consider adding a direct link to https://arxiv.org/abs/2503.03622

examples/multi_user_attribution_example.py

+                  Args:
+                    user_mapping: A dictionary mapping an example index to a list of its owners.
+                    k_limit: The maximum number of times any single user can contribute data.

Collaborator

arung54 Jan 21, 2026

Maybe add a check that this is >= 1

examples/multi_user_attribution_example.py

+                  # This multi-pass loop implements the "Greedy with Duplicates" strategy.
+                  # By allowing duplicates in multiple passes, we increase the total sample
+                  # size N, which significantly improves the signal-to-noise ratio.
+                  while True:

Collaborator

arung54 Jan 21, 2026

Optional: In pass i > 1, any elements added must have already been added in pass i - 1. This means there are two ways to speed up the code:
-The algorithm needs at most k_limit passes (and you could make it more efficient by preventing it from making the (k_limit + 1)-th pass)
-You could remove an example from sorted_indices if it's not able to added (so that future passes over sorted_indices are shorter)

Since this is an example optimizing the runtime isn't super important, but it could be nice.

examples/multi_user_attribution_example.py

		return safe_indices


		def generate_dummy_data(num_examples, num_users, max_owners):

Collaborator

arung54 Jan 21, 2026

Add types here

examples/multi_user_attribution_example.py

+                  print(f"\nRunning selection logic with k_limit = {k_limit}...")
+                  safe_indices = get_safe_indices(user_mapping, k_limit)
+                  # --- 3. Mandatory Correctness Verification ---

Collaborator

arung54 Jan 21, 2026

If the algorithm is implemented correctly this block should be unnecessary; I'd recommend deleting (this might be reasonable to e.g. include in a test file, but I don't think we need it here)

examples/multi_user_attribution_example.py

+                  )
+                  # Initialize a simple Flax model and optimizer
+                  model = SimpleDenseNet()

Collaborator

arung54 Jan 21, 2026

If we are not actually going to compute gradients, what is the need to even define and initialize the model? I would either delete the code relating to the model for clarity, or try to implement actual DP optimization (e.g. using https://github.com/google-deepmind/jax_privacy/blob/main/examples/dp_logistic_regression.py as a template)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet