Skip to content

Conversation

@debanganghosh08
Copy link

This Pull Request introduces a standalone example script demonstrating Multi-User Contribution Bounding, as detailed in Research Paper

The example addresses the "multi-attribution" problem where single records (such as group chat messages or co-authored research papers) are owned by multiple users simultaneously. To provide a rigorous privacy guarantee for all owners, we must ensure no single user’s data is over-represented in the final training set1.

Implementation Highlights:

✔️ Greedy Bounding Algorithm: I have implemented the Greedy with Duplicates strategy (Algorithm 3 from the paper)22. This approach prioritizes examples with lower cardinality (fewer owners) to maximize dataset utility while strictly adhering to a per-user contribution limit ($k$)33.+1
✔️ Duplicate Handling: Following the paper's findings that noise reduction often outweighs sampling bias in high-dimensional training, the logic allows for multiple passes over the dataset to maximize the sample size ($N$)4444.+1
✔️ Poisson Integration: The script demonstrates how to integrate this pre-processing step with a standard DP training loop using CyclicPoissonSampling.
✔️ Correctness Verification: To verify the mathematical integrity of the selection process, I included the following automated checks in the script: Strict Bound Enforcement: An explicit assertion block verifies the final selection against the user_mapping. The log confirms: [Verification] Verified k_limit check passed.
Max user contribution: 15.

Utilization Metrics: The script reports the Utilization Rate, demonstrating that the "Duplicates" strategy successfully retrieves more than 100% of the unique indices in high-budget scenarios, significantly improving the signal-to-noise ratio for the DP optimizer5555.

@debanganghosh08 debanganghosh08 changed the title Multi-User Contribution Bounding for shared ownership datasets (#88) [Example] Multi-User Contribution Bounding for shared ownership datasets (#88) Split PR 2 Jan 21, 2026
@debanganghosh08 debanganghosh08 changed the title [Example] Multi-User Contribution Bounding for shared ownership datasets (#88) Split PR 2 Multi-User Contribution Bounding for shared ownership datasets (#88) Split PR 2 Jan 21, 2026
from jax_privacy.batch_selection import CyclicPoissonSampling


def get_safe_indices(user_mapping, k_limit):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to add types here

# limitations under the License.
"""Example of data selection for datasets with multi-user attribution.

Literature Reference: Ganesh et al. (2025), "It’s My Data Too: Private ML for
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a direct link to https://arxiv.org/abs/2503.03622


Args:
user_mapping: A dictionary mapping an example index to a list of its owners.
k_limit: The maximum number of times any single user can contribute data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a check that this is >= 1

# This multi-pass loop implements the "Greedy with Duplicates" strategy.
# By allowing duplicates in multiple passes, we increase the total sample
# size N, which significantly improves the signal-to-noise ratio.
while True:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: In pass i > 1, any elements added must have already been added in pass i - 1. This means there are two ways to speed up the code:
-The algorithm needs at most k_limit passes (and you could make it more efficient by preventing it from making the (k_limit + 1)-th pass)
-You could remove an example from sorted_indices if it's not able to added (so that future passes over sorted_indices are shorter)

Since this is an example optimizing the runtime isn't super important, but it could be nice.

return safe_indices


def generate_dummy_data(num_examples, num_users, max_owners):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add types here

print(f"\nRunning selection logic with k_limit = {k_limit}...")
safe_indices = get_safe_indices(user_mapping, k_limit)

# --- 3. Mandatory Correctness Verification ---
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the algorithm is implemented correctly this block should be unnecessary; I'd recommend deleting (this might be reasonable to e.g. include in a test file, but I don't think we need it here)

)

# Initialize a simple Flax model and optimizer
model = SimpleDenseNet()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are not actually going to compute gradients, what is the need to even define and initialize the model? I would either delete the code relating to the model for clarity, or try to implement actual DP optimization (e.g. using https://github.com/google-deepmind/jax_privacy/blob/main/examples/dp_logistic_regression.py as a template)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants