Skip to content

Behavior of censor_dims=true that I don't understand #568

@mhauru

Description

@mhauru

I have a case in which I don't understand the behavior of censor_dims. Namely, I run essentially the same query in two different ways but get different results. The data is a table with two columns category and id. The two ways are

  1. By doing a COUNT(*) ... GROUP BY category query with max_ids=100
  2. By doing a non-DP preprocessing of COUNT(*) AS counts ... GROUP BY category, id and then on the result of that a SUM(counts) ... GROUP BY category query with max_ids=2 (because category takes two different values) but upper=100 for counts.

I would expect the results be the same, and they roughly are, except for censor_dims which consistently censors the results in the first case, but not in the second.

I'm not sure if this is an issue with smartnoise-sql or with my understanding of DP/the censor dims mechanism, but I'm hoping to get some clarity on that. Here's a code snippet that illustrates:

import numpy as np
import pandas as pd
import random
import snsql
import yaml


def run_dp_query(df, query, epsilon, delta, metadata_yaml):
    metadata = yaml.load(metadata_yaml, Loader=yaml.SafeLoader)
    snsql_metadata = {"": {"": {"test_table": metadata}}}
    privacy = snsql.Privacy(epsilon=epsilon, delta=delta)
    reader = snsql.from_df(df, privacy=privacy, metadata=snsql_metadata)
    private_result = reader.execute(query)
    return private_result


def main():
    # Create some data
    random.seed(0)
    ids = list(range(100))
    df1 = pd.DataFrame(
        [
            {
                "id": random.choice(ids),
                "category": random.choice(["a", "b"]),
            }
            for i in range(10000)
        ]
    )

    # Set DP parameters
    # NOTE the extremely large value of epsilon.
    # The surprising behavior is there with all smaller values too.
    epsilon = 100.0
    delta = 0.0001

    # Run a DP query that counts values by category.
    query1 = """SELECT
          COUNT(*) AS counts, category
          FROM test_table
          GROUP BY category"""
    metadata_yaml1 = """
          max_ids: 100
          id:
            type: int
            private_id: true
          category:
            type: string"""

    private_result1 = run_dp_query(df1, query1, epsilon, delta, metadata_yaml1)
    print("DP counts the first way:")
    print(private_result1)

    # Run the same query in a different way, with some preprocessing.
    # Namely, first count the values in each category for each id without DP, then sum
    # over different ids with DP.
    df2 = df1.groupby(["id", "category"]).size().reset_index(name="counts")
    query2 = """SELECT
          SUM(counts) AS counts, category
          FROM test_table
          GROUP BY category"""
    metadata_yaml2 = """
          max_ids: 2
          id:
            type: int
            private_id: true
          category:
            type: string
          counts:
            type: int
            lower: 0
            upper: 100"""
    private_result2 = run_dp_query(df2, query2, epsilon, delta, metadata_yaml2)
    print("DP counts the second way:")
    print(private_result2)


main()

Typical output of running the above is

DP counts the first way:
[['counts', 'category']]
DP counts the second way:
[['counts', 'category'], [5071, 'a'], [4930, 'b']]

Tagging @fhoussiau because we had a chat about this earlier today.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions