I have a case in which I don't understand the behavior of censor_dims. Namely, I run essentially the same query in two different ways but get different results. The data is a table with two columns category and id. The two ways are
- By doing a
COUNT(*) ... GROUP BY category query with max_ids=100
- By doing a non-DP preprocessing of
COUNT(*) AS counts ... GROUP BY category, id and then on the result of that a SUM(counts) ... GROUP BY category query with max_ids=2 (because category takes two different values) but upper=100 for counts.
I would expect the results be the same, and they roughly are, except for censor_dims which consistently censors the results in the first case, but not in the second.
I'm not sure if this is an issue with smartnoise-sql or with my understanding of DP/the censor dims mechanism, but I'm hoping to get some clarity on that. Here's a code snippet that illustrates:
import numpy as np
import pandas as pd
import random
import snsql
import yaml
def run_dp_query(df, query, epsilon, delta, metadata_yaml):
metadata = yaml.load(metadata_yaml, Loader=yaml.SafeLoader)
snsql_metadata = {"": {"": {"test_table": metadata}}}
privacy = snsql.Privacy(epsilon=epsilon, delta=delta)
reader = snsql.from_df(df, privacy=privacy, metadata=snsql_metadata)
private_result = reader.execute(query)
return private_result
def main():
# Create some data
random.seed(0)
ids = list(range(100))
df1 = pd.DataFrame(
[
{
"id": random.choice(ids),
"category": random.choice(["a", "b"]),
}
for i in range(10000)
]
)
# Set DP parameters
# NOTE the extremely large value of epsilon.
# The surprising behavior is there with all smaller values too.
epsilon = 100.0
delta = 0.0001
# Run a DP query that counts values by category.
query1 = """SELECT
COUNT(*) AS counts, category
FROM test_table
GROUP BY category"""
metadata_yaml1 = """
max_ids: 100
id:
type: int
private_id: true
category:
type: string"""
private_result1 = run_dp_query(df1, query1, epsilon, delta, metadata_yaml1)
print("DP counts the first way:")
print(private_result1)
# Run the same query in a different way, with some preprocessing.
# Namely, first count the values in each category for each id without DP, then sum
# over different ids with DP.
df2 = df1.groupby(["id", "category"]).size().reset_index(name="counts")
query2 = """SELECT
SUM(counts) AS counts, category
FROM test_table
GROUP BY category"""
metadata_yaml2 = """
max_ids: 2
id:
type: int
private_id: true
category:
type: string
counts:
type: int
lower: 0
upper: 100"""
private_result2 = run_dp_query(df2, query2, epsilon, delta, metadata_yaml2)
print("DP counts the second way:")
print(private_result2)
main()
Typical output of running the above is
DP counts the first way:
[['counts', 'category']]
DP counts the second way:
[['counts', 'category'], [5071, 'a'], [4930, 'b']]
Tagging @fhoussiau because we had a chat about this earlier today.
I have a case in which I don't understand the behavior of
censor_dims. Namely, I run essentially the same query in two different ways but get different results. The data is a table with two columnscategoryandid. The two ways areCOUNT(*) ... GROUP BY categoryquery withmax_ids=100COUNT(*) AS counts ... GROUP BY category, idand then on the result of that aSUM(counts) ... GROUP BY categoryquery withmax_ids=2(becausecategorytakes two different values) butupper=100forcounts.I would expect the results be the same, and they roughly are, except for
censor_dimswhich consistently censors the results in the first case, but not in the second.I'm not sure if this is an issue with smartnoise-sql or with my understanding of DP/the censor dims mechanism, but I'm hoping to get some clarity on that. Here's a code snippet that illustrates:
Typical output of running the above is
Tagging @fhoussiau because we had a chat about this earlier today.