Use JSONB containment for GIN-friendly EQ filters#3646
Use JSONB containment for GIN-friendly EQ filters#3646sambhav wants to merge 4 commits intoKinto:mainfrom
Conversation
Rewrite _format_conditions to emit `data @> '{"field": value}'` instead
of `data->'field' = 'value'::jsonb` for equality filters on scalar data
fields (str, int, float, bool, None). This is semantically equivalent
for scalars but enables GIN index acceleration when a
`gin(data jsonb_path_ops)` index exists on the objects table.
Array and object EQ values still use the arrow extraction path to
preserve exact equality semantics (containment uses superset matching
for non-scalars).
Also normalizes _format_sorting JSONB accessor expressions to match
_format_conditions format (removing redundant parentheses around
placeholders), ensuring expression indexes work for both filter and
sort queries.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rewrite CONTAINS filters to use `data @> '{"field": [values]}'` instead
of `data->'field' @> '[values]'`. Top-level containment allows a GIN
index on the data column to accelerate these queries. The jsonb_typeof
guard is no longer needed since containment already returns false when
the field is not an array.
Add documentation to the Storage class docstring describing the
recommended GIN index, what it accelerates, what it doesn't, and
approximate sizing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Expand the GIN index documentation with three options: 1. Recommended: partial index with WHERE resource_name = 'record' (smallest, scoped to actual records, works with psycopg2) 2. Basic: partial index with WHERE NOT deleted only (driver-independent fallback) 3. Composite: btree_gin extension with parent_id + resource_name in the GIN index (single index scan, no BitmapAnd needed) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add tests covering: - CONTAINS fallback for id/modified fields (covers dead-code branch) - EQ with falsy scalars: None, empty string, 0, False - EQ with deeply nested fields (a.b.c) - EQ with empty arrays/objects (must NOT use containment) - CONTAINS with numeric arrays and object elements - CONTAINS with nested fields These tests verify that the @> rewrite produces identical behavior to the old arrow extraction form across all data types and edge cases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
leplatrem
left a comment
There was a problem hiding this comment.
Thank you
This seems useful indeed.
I left some comments/questions.
I was thinking we could ship the indexation migrations with this pull-request.
But we could indeed do it in several steps:
- Merge this
- Deploy
- Create indexes manually on DB
- Validate improvements
- Create another PR that ships these indexes as migrations
WDYT?
| This index is **not created automatically** because it can take significant | ||
| time on large tables. Create it manually using ``CONCURRENTLY`` to avoid | ||
| blocking reads and writes. |
There was a problem hiding this comment.
We've created indexes In previous migrations already. I think we could do it, and just add some warning in the release notes
| If you switch to a driver that uses server-side parameter binding | ||
| (e.g. psycopg3 defaults), the planner may not be able to prove the | ||
| partial condition is satisfied. In that case, fall back to the | ||
| basic index below. |
There was a problem hiding this comment.
Note: Kinto should always pin the version of psycopg that it uses. So this note is more adressed to us when upgrading to psycopg3 in the future. Let's put this in an issue!
| This is the smallest and most efficient option. The | ||
| ``resource_name = 'record'`` partial condition excludes bucket, collection, | ||
| and group metadata objects (which are never filtered by JSONB fields), |
There was a problem hiding this comment.
We have idx_objects_history_userid_and_resourcename which indexes the history objects.
CREATE INDEX idx_objects_history_userid_and_resourcename
ON objects ((data->'user_id'), (data->'resource_name'))
WHERE resource_name = 'history';
Browsing history in the Admin UI heavily relies on data-> filters.
Shall we drop it and replace it with one using gin?
| CREATE INDEX CONCURRENTLY idx_objects_data_gin | ||
| ON objects USING gin (parent_id, resource_name, data jsonb_path_ops) | ||
| WHERE NOT deleted; |
There was a problem hiding this comment.
Shall we replace the idx_objects_resource_name_parent_id_deleted that we already have with this composite?
| containment_obj = filtr.value | ||
| for subfield in reversed(subfields): | ||
| containment_obj = {subfield: containment_obj} | ||
| holders[value_holder] = json.dumps(containment_obj) |
There was a problem hiding this comment.
Shall we extract these few lines in a small helper? (reused with CONTAINS)
| value = holders["filters_value_0"] | ||
| self.assertEqual(json.loads(value), {"count": 0}) | ||
|
|
||
| def test_eq_with_false_uses_containment(self): |
There was a problem hiding this comment.
Merge with test_eq_boolean_uses_containment
| ] | ||
|
|
||
|
|
||
| class FormatConditionsContainmentTest(unittest.TestCase): |
There was a problem hiding this comment.
For better readability, please split this test into two: one for assertions about COMPARISON.EQ and another for COMPARISON.CONTAINS
| sql, holders = storage._format_sorting(sorting, "id", "last_modified") | ||
| # Should be ->:sort_field_0_0 not ->(sort_field_0_0) | ||
| self.assertIn("->:sort_field_0_0", sql) | ||
| self.assertNotIn("->(:", sql) |
There was a problem hiding this comment.
This assertNotIn() refers to some code deleted by this PR. I don't this this is relevant
| sorting = [Sort("status", 1)] | ||
| sql, holders = storage._format_sorting(sorting, "id", "last_modified") | ||
| # Should be ->:sort_field_0_0 not ->(sort_field_0_0) | ||
| self.assertIn("->:sort_field_0_0", sql) |
There was a problem hiding this comment.
assert on holders["sort_field_0_0"] ?
| sorting = [Sort("person.name", 1)] | ||
| sql, holders = storage._format_sorting(sorting, "id", "last_modified") | ||
| self.assertIn("->:sort_field_0_0->:sort_field_0_1", sql) | ||
| self.assertNotIn("->(:", sql) |
There was a problem hiding this comment.
ditto
assert on holders["sort_field_0_0"] and holders["sort_field_0_1"] ?
Summary
_format_conditionsto emitdata @> '{"field": value}'(JSONB containment) instead ofdata->'field' = 'value'::jsonbfor EQ filters on scalar data fields (str, int, float, bool, None)data @> '{"field": [values]}') instead of sub-expression containment (data->'field' @> '[values]'), removing the now-redundantjsonb_typeofguard@>uses superset matching for non-scalars)_format_sortingJSONB accessor expressions to match_format_conditionsformat (removes redundant parentheses)Why this matters
The
@>containment operator is the only JSONB operator that GIN indexes accelerate. Previously, EQ useddata->'field' = valueand CONTAINS useddata->'field' @> value— neither form can use a GIN index ondata. By rewriting both to top-leveldata @> '{"field": value}', a single GIN index accelerates all equality and array-contains queries across all collections and fields:Without this index, the query rewrites have zero performance impact — they're semantically equivalent to the old form. The index is intentionally not auto-created; it's documented as an optional optimization for large deployments.
What the GIN index accelerates:
?status=active→data @> '{"status": "active"}'?person.name=Alice→data @> '{"person": {"name": "Alice"}}'?contains_colors=red→data @> '{"colors": ["red"]}'What it does NOT accelerate:
min_,max_,gt_,lt_)contains_any_(uses&&array overlap)Test plan
🤖 Generated with Claude Code