Behavior of `censor_dims=true` that I don't understand

I have a case in which I don't understand the behavior of `censor_dims`. Namely, I run essentially the same query in two different ways but get different results. The data is a table with two columns `category` and `id`. The two ways are 
1. By doing a `COUNT(*) ... GROUP BY category` query with `max_ids=100`
2. By doing a non-DP preprocessing of `COUNT(*) AS counts ... GROUP BY category, id` and then on the result of that a `SUM(counts) ... GROUP BY category` query with `max_ids=2` (because `category` takes two different values) but `upper=100` for `counts`.
 
I would expect the results be the same, and they roughly are, except for `censor_dims` which consistently censors the results in the first case, but not in the second.

I'm not sure if this is an issue with smartnoise-sql or with my understanding of DP/the censor dims mechanism, but I'm hoping to get some clarity on that. Here's a code snippet that illustrates:

```
import numpy as np
import pandas as pd
import random
import snsql
import yaml


def run_dp_query(df, query, epsilon, delta, metadata_yaml):
    metadata = yaml.load(metadata_yaml, Loader=yaml.SafeLoader)
    snsql_metadata = {"": {"": {"test_table": metadata}}}
    privacy = snsql.Privacy(epsilon=epsilon, delta=delta)
    reader = snsql.from_df(df, privacy=privacy, metadata=snsql_metadata)
    private_result = reader.execute(query)
    return private_result


def main():
    # Create some data
    random.seed(0)
    ids = list(range(100))
    df1 = pd.DataFrame(
        [
            {
                "id": random.choice(ids),
                "category": random.choice(["a", "b"]),
            }
            for i in range(10000)
        ]
    )

    # Set DP parameters
    # NOTE the extremely large value of epsilon.
    # The surprising behavior is there with all smaller values too.
    epsilon = 100.0
    delta = 0.0001

    # Run a DP query that counts values by category.
    query1 = """SELECT
          COUNT(*) AS counts, category
          FROM test_table
          GROUP BY category"""
    metadata_yaml1 = """
          max_ids: 100
          id:
            type: int
            private_id: true
          category:
            type: string"""

    private_result1 = run_dp_query(df1, query1, epsilon, delta, metadata_yaml1)
    print("DP counts the first way:")
    print(private_result1)

    # Run the same query in a different way, with some preprocessing.
    # Namely, first count the values in each category for each id without DP, then sum
    # over different ids with DP.
    df2 = df1.groupby(["id", "category"]).size().reset_index(name="counts")
    query2 = """SELECT
          SUM(counts) AS counts, category
          FROM test_table
          GROUP BY category"""
    metadata_yaml2 = """
          max_ids: 2
          id:
            type: int
            private_id: true
          category:
            type: string
          counts:
            type: int
            lower: 0
            upper: 100"""
    private_result2 = run_dp_query(df2, query2, epsilon, delta, metadata_yaml2)
    print("DP counts the second way:")
    print(private_result2)


main()
```

Typical output of running the above is
```
DP counts the first way:
[['counts', 'category']]
DP counts the second way:
[['counts', 'category'], [5071, 'a'], [4930, 'b']]
```
Tagging @fhoussiau because we had a chat about this earlier today.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behavior of `censor_dims=true` that I don't understand #568

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Behavior of censor_dims=true that I don't understand #568

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Behavior of `censor_dims=true` that I don't understand #568