Skip to content

Use JSONB containment for GIN-friendly EQ filters#3646

Open
sambhav wants to merge 4 commits intoKinto:mainfrom
sambhav:gin-containment-optimization
Open

Use JSONB containment for GIN-friendly EQ filters#3646
sambhav wants to merge 4 commits intoKinto:mainfrom
sambhav:gin-containment-optimization

Conversation

@sambhav
Copy link
Contributor

@sambhav sambhav commented Feb 15, 2026

Summary

  • Rewrites _format_conditions to emit data @> '{"field": value}' (JSONB containment) instead of data->'field' = 'value'::jsonb for EQ filters on scalar data fields (str, int, float, bool, None)
  • Rewrites CONTAINS filters to use top-level containment (data @> '{"field": [values]}') instead of sub-expression containment (data->'field' @> '[values]'), removing the now-redundant jsonb_typeof guard
  • Array/object EQ values keep using arrow extraction to preserve exact equality semantics (@> uses superset matching for non-scalars)
  • Normalizes _format_sorting JSONB accessor expressions to match _format_conditions format (removes redundant parentheses)
  • Documents the recommended GIN index in the Storage class docstring

Why this matters

The @> containment operator is the only JSONB operator that GIN indexes accelerate. Previously, EQ used data->'field' = value and CONTAINS used data->'field' @> value — neither form can use a GIN index on data. By rewriting both to top-level data @> '{"field": value}', a single GIN index accelerates all equality and array-contains queries across all collections and fields:

CREATE INDEX CONCURRENTLY idx_objects_data_gin
    ON objects USING gin (data jsonb_path_ops)
    WHERE NOT deleted;

Without this index, the query rewrites have zero performance impact — they're semantically equivalent to the old form. The index is intentionally not auto-created; it's documented as an optional optimization for large deployments.

What the GIN index accelerates:

  • ?status=activedata @> '{"status": "active"}'
  • ?person.name=Alicedata @> '{"person": {"name": "Alice"}}'
  • ?contains_colors=reddata @> '{"colors": ["red"]}'

What it does NOT accelerate:

  • Range filters (min_, max_, gt_, lt_)
  • LIKE/text search
  • contains_any_ (uses && array overlap)
  • Sorting on JSONB fields

Test plan

  • 19 new unit tests for SQL generation (EQ scalars, EQ arrays/objects, CONTAINS, CONTAINS_ANY, nested fields, non-EQ operators, id/modified exclusion, sorting normalization)
  • 2 new integration tests for array/object exact equality
  • All 183 existing non-PostgreSQL storage tests pass
  • All 64 filter/sort resource tests pass

🤖 Generated with Claude Code

sambhav and others added 4 commits February 15, 2026 20:57
Rewrite _format_conditions to emit `data @> '{"field": value}'` instead
of `data->'field' = 'value'::jsonb` for equality filters on scalar data
fields (str, int, float, bool, None). This is semantically equivalent
for scalars but enables GIN index acceleration when a
`gin(data jsonb_path_ops)` index exists on the objects table.

Array and object EQ values still use the arrow extraction path to
preserve exact equality semantics (containment uses superset matching
for non-scalars).

Also normalizes _format_sorting JSONB accessor expressions to match
_format_conditions format (removing redundant parentheses around
placeholders), ensuring expression indexes work for both filter and
sort queries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rewrite CONTAINS filters to use `data @> '{"field": [values]}'` instead
of `data->'field' @> '[values]'`. Top-level containment allows a GIN
index on the data column to accelerate these queries. The jsonb_typeof
guard is no longer needed since containment already returns false when
the field is not an array.

Add documentation to the Storage class docstring describing the
recommended GIN index, what it accelerates, what it doesn't, and
approximate sizing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Expand the GIN index documentation with three options:

1. Recommended: partial index with WHERE resource_name = 'record'
   (smallest, scoped to actual records, works with psycopg2)
2. Basic: partial index with WHERE NOT deleted only
   (driver-independent fallback)
3. Composite: btree_gin extension with parent_id + resource_name
   in the GIN index (single index scan, no BitmapAnd needed)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add tests covering:
- CONTAINS fallback for id/modified fields (covers dead-code branch)
- EQ with falsy scalars: None, empty string, 0, False
- EQ with deeply nested fields (a.b.c)
- EQ with empty arrays/objects (must NOT use containment)
- CONTAINS with numeric arrays and object elements
- CONTAINS with nested fields

These tests verify that the @> rewrite produces identical behavior
to the old arrow extraction form across all data types and edge cases.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@leplatrem leplatrem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

This seems useful indeed.

I left some comments/questions.
I was thinking we could ship the indexation migrations with this pull-request.
But we could indeed do it in several steps:

  1. Merge this
  2. Deploy
  3. Create indexes manually on DB
  4. Validate improvements
  5. Create another PR that ships these indexes as migrations

WDYT?

This index is **not created automatically** because it can take significant
time on large tables. Create it manually using ``CONCURRENTLY`` to avoid
blocking reads and writes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've created indexes In previous migrations already. I think we could do it, and just add some warning in the release notes

If you switch to a driver that uses server-side parameter binding
(e.g. psycopg3 defaults), the planner may not be able to prove the
partial condition is satisfied. In that case, fall back to the
basic index below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Kinto should always pin the version of psycopg that it uses. So this note is more adressed to us when upgrading to psycopg3 in the future. Let's put this in an issue!

This is the smallest and most efficient option. The
``resource_name = 'record'`` partial condition excludes bucket, collection,
and group metadata objects (which are never filtered by JSONB fields),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have idx_objects_history_userid_and_resourcename which indexes the history objects.

CREATE INDEX idx_objects_history_userid_and_resourcename
  ON objects ((data->'user_id'), (data->'resource_name'))
  WHERE resource_name = 'history';

Browsing history in the Admin UI heavily relies on data-> filters.

Shall we drop it and replace it with one using gin?

CREATE INDEX CONCURRENTLY idx_objects_data_gin
ON objects USING gin (parent_id, resource_name, data jsonb_path_ops)
WHERE NOT deleted;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we replace the idx_objects_resource_name_parent_id_deleted that we already have with this composite?

containment_obj = filtr.value
for subfield in reversed(subfields):
containment_obj = {subfield: containment_obj}
holders[value_holder] = json.dumps(containment_obj)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we extract these few lines in a small helper? (reused with CONTAINS)

value = holders["filters_value_0"]
self.assertEqual(json.loads(value), {"count": 0})

def test_eq_with_false_uses_containment(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge with test_eq_boolean_uses_containment

]


class FormatConditionsContainmentTest(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better readability, please split this test into two: one for assertions about COMPARISON.EQ and another for COMPARISON.CONTAINS

sql, holders = storage._format_sorting(sorting, "id", "last_modified")
# Should be ->:sort_field_0_0 not ->(sort_field_0_0)
self.assertIn("->:sort_field_0_0", sql)
self.assertNotIn("->(:", sql)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertNotIn() refers to some code deleted by this PR. I don't this this is relevant

sorting = [Sort("status", 1)]
sql, holders = storage._format_sorting(sorting, "id", "last_modified")
# Should be ->:sort_field_0_0 not ->(sort_field_0_0)
self.assertIn("->:sort_field_0_0", sql)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert on holders["sort_field_0_0"] ?

sorting = [Sort("person.name", 1)]
sql, holders = storage._format_sorting(sorting, "id", "last_modified")
self.assertIn("->:sort_field_0_0->:sort_field_0_1", sql)
self.assertNotIn("->(:", sql)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

assert on holders["sort_field_0_0"] and holders["sort_field_0_1"] ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants