Skip to content

Commit 152bdf8

Browse files
authored
Merge pull request #4 from bluedynamics/feature/fulltext-search-v2
Language-aware full-text search + Title/Description fix
2 parents 6eb5180 + b28ee7d commit 152bdf8

12 files changed

Lines changed: 623 additions & 16 deletions

File tree

ARCHITECTURE.md

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ Indexes are discovered from ZCatalog at startup, not hardcoded.
4141
4. `IndexRegistry.sync_from_catalog()` reads `catalog._catalog.indexes`
4242
5. Each index's `meta_type` is mapped via `META_TYPE_MAP` to an `IndexType` enum
4343
6. `_register_dri_translators()` discovers `DateRecurringIndex` instances and registers `IPGIndexTranslator` utilities
44+
7. `_ensure_text_indexes()` creates GIN expression indexes for any dynamically discovered `TEXT`-type indexes with `idx_key is not None` (Title, Description, addon ZCTextIndex fields)
4445

4546
### IndexType Enum
4647

@@ -215,7 +216,9 @@ _HANDLERS = {
215216

216217
**PathIndex**: `idx->>'path' LIKE '/plone/folder/%'` (subtree), `idx->>'path_parent' = '/plone/folder'` (children), navtree breadcrumb queries.
217218

218-
**TextIndex**: `searchable_text @@ plainto_tsquery('simple', %(text)s)` for SearchableText; other text indexes treated as field match.
219+
**TextIndex (SearchableText)**: `searchable_text @@ plainto_tsquery(pgcatalog_lang_to_regconfig(%(lang)s)::regconfig, %(text)s)` -- language-aware stemming via the per-object `Language` field. Falls back to `'simple'` when no language is set.
220+
221+
**TextIndex (Title/Description/addon)**: `to_tsvector('simple'::regconfig, COALESCE(idx->>'Title', '')) @@ plainto_tsquery('simple'::regconfig, %(text)s)` -- word-level matching on idx JSONB values, backed by GIN expression indexes. Uses `'simple'` config (no stemming) because expression indexes require a fixed regconfig.
219222

220223
## Transactional Writes
221224

@@ -239,12 +242,46 @@ Includes:
239242

240243
- `ALTER TABLE object_state ADD COLUMN IF NOT EXISTS ...` for catalog columns (`path`, `idx`, `searchable_text`)
241244
- `pgcatalog_to_timestamptz()` immutable wrapper for expression indexes
245+
- `pgcatalog_lang_to_regconfig()` maps Plone language codes (ISO 639-1) to PG text search configurations (e.g. `'de'``'german'`). Used at both write time (`to_tsvector`) and query time (`plainto_tsquery`). Returns `'simple'` for NULL, empty, or unmapped languages.
242246
- GIN index on `idx` JSONB
243247
- B-tree expression indexes on `idx` JSONB for path queries (`path`, `path_parent`, `path_depth`)
244248
- B-tree expression indexes for common sort/filter fields (modified, created, effective, expires, sortable_title, portal_type, review_state, UID)
245249
- Full-text GIN index on `searchable_text`
250+
- GIN expression indexes for Title/Description tsvector matching (`to_tsvector('simple', COALESCE(idx->>'Title', ''))`)
251+
- Dynamic GIN expression indexes for addon ZCTextIndex fields (created at startup by `_ensure_text_indexes()`)
246252
- rrule_plpgsql schema and functions (for DateRecurringIndex)
247253

254+
## Full-Text Search
255+
256+
Three tiers of text search, each with different characteristics:
257+
258+
### SearchableText (Language-Aware)
259+
260+
Uses the dedicated `searchable_text` TSVECTOR column with per-object language stemming:
261+
262+
- **Write path**: `to_tsvector(pgcatalog_lang_to_regconfig(idx->>'Language')::regconfig, text)` -- language extracted from the object's `Language` field in idx JSONB
263+
- **Query path**: `searchable_text @@ plainto_tsquery(pgcatalog_lang_to_regconfig(%(lang)s)::regconfig, %(text)s)` -- language from the query's `Language` filter
264+
- **Index**: GIN on `searchable_text` column
265+
- **Stemming**: Yes, for the 30 supported languages (falls back to `'simple'` for unknown/empty)
266+
267+
### Title / Description (Word-Level)
268+
269+
Uses tsvector expression matching on idx JSONB values:
270+
271+
- **Write path**: Values stored as plain text in `idx->>'Title'` / `idx->>'Description'`
272+
- **Query path**: `to_tsvector('simple', COALESCE(idx->>'Title', '')) @@ plainto_tsquery('simple', %(text)s)`
273+
- **Index**: GIN expression indexes (pre-created in DDL)
274+
- **Stemming**: No (`'simple'` config) -- expression indexes require a fixed regconfig. Language-aware stemmed search for titles is available via SearchableText (which includes title text).
275+
276+
### Addon ZCTextIndex Fields
277+
278+
Any addon that registers a ZCTextIndex in ZCatalog (via `catalog.xml`) is automatically supported:
279+
280+
1. `sync_from_catalog()` discovers the index → registered as `(IndexType.TEXT, idx_key, source_attrs)`
281+
2. `_ensure_text_indexes()` creates a GIN expression index at startup: `to_tsvector('simple', COALESCE(idx->>'{idx_key}', ''))`
282+
3. Value extracted into idx JSONB during indexing (idx_key is not None)
283+
4. `_handle_text()` generates tsvector expression matching -- zero addon code needed
284+
248285
## Query Optimizations
249286

250287
1. **orjson**: Registered as psycopg's JSONB deserializer for faster JSON parsing

CHANGES.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,34 @@
11
# Changelog
22

3+
## 1.0.0b4
4+
5+
### Added
6+
7+
- **Language-aware full-text search**: SearchableText now uses per-object
8+
language for stemming. The `pgcatalog_lang_to_regconfig()` PL/pgSQL function
9+
maps Plone language codes (ISO 639-1, 30 languages) to PostgreSQL text search
10+
configurations (e.g. `"de"``german`). Falls back to `'simple'` for
11+
unmapped or missing languages. Non-multilingual sites are unaffected.
12+
13+
Python mirror: `columns.language_to_regconfig()` for testing/validation.
14+
15+
- **Title/Description text search**: Title and Description queries now use
16+
tsvector word-level matching instead of exact JSONB containment.
17+
`catalog(Title="Hello")` now correctly matches `"Hello World"`.
18+
Backed by GIN expression indexes with `'simple'` config (no stemming).
19+
20+
- **Automatic addon ZCTextIndex support**: Addon-registered ZCTextIndex fields
21+
are automatically discovered at startup. GIN expression indexes are created
22+
dynamically by `_ensure_text_indexes()`, and queries use tsvector matching --
23+
zero addon code needed.
24+
25+
### Fixed
26+
27+
- **Title/Description query broken**: Previously, querying Title or Description
28+
as ZCTextIndex used JSONB exact containment (`idx @> '{"Title":"Hello"}'`),
29+
which only matched exact values, not words within text. Now uses
30+
`to_tsvector`/`plainto_tsquery` for proper word-level matching.
31+
332
## 1.0.0b3
433

534
### Fixed

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Requires [zodb-pgjsonb](https://github.com/bluedynamics/zodb-pgjsonb) as the ZOD
1111
- **Extensible** via `IPGIndexTranslator` named utilities for custom index types
1212
- **Dynamic index discovery** from ZCatalog at startup -- addons adding indexes via `catalog.xml` just work
1313
- **Transactional writes** -- catalog data written atomically alongside object state during ZODB commit
14-
- **Full-text search** via PostgreSQL `tsvector`/`tsquery`
14+
- **Full-text search** via PostgreSQL `tsvector`/`tsquery` -- language-aware stemming for SearchableText (30 languages), word-level matching for Title/Description/addon ZCTextIndex fields
1515
- **Zero ZODB cache pressure** -- no BTree/Bucket objects stored in ZODB
1616
- **Container-friendly** -- works on standard `postgres:17` Docker images, no extensions required
1717

@@ -51,6 +51,8 @@ Once installed, `portal_catalog` is replaced with `PlonePGCatalogTool`. All cata
5151
results = catalog(portal_type="Document", review_state="published")
5252
results = catalog(Subject={"query": ["Python", "Plone"], "operator": "or"})
5353
results = catalog(SearchableText="my search term")
54+
results = catalog(SearchableText="Katzen", Language="de") # language-aware stemming
55+
results = catalog(Title="quick fox") # word-level match (finds "The Quick Brown Fox")
5456
results = catalog(path={"query": "/plone/folder", "depth": 1})
5557

5658
# Recurring events (DateRecurringIndex)

src/plone/pgcatalog/columns.py

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -269,6 +269,61 @@ def _convert_zope_datetime(dt):
269269
return dt.ISO8601()
270270

271271

272+
# --------------------------------------------------------------------------
273+
# Language → PG text search configuration mapping
274+
# --------------------------------------------------------------------------
275+
276+
# Plone language code (ISO 639-1) → PostgreSQL regconfig name.
277+
# Mirrors the SQL function pgcatalog_lang_to_regconfig() in schema.py.
278+
_LANG_TO_REGCONFIG = {
279+
"ar": "arabic",
280+
"hy": "armenian",
281+
"eu": "basque",
282+
"ca": "catalan",
283+
"da": "danish",
284+
"nl": "dutch",
285+
"en": "english",
286+
"fi": "finnish",
287+
"fr": "french",
288+
"de": "german",
289+
"el": "greek",
290+
"hi": "hindi",
291+
"hu": "hungarian",
292+
"id": "indonesian",
293+
"ga": "irish",
294+
"it": "italian",
295+
"lt": "lithuanian",
296+
"ne": "nepali",
297+
"nb": "norwegian",
298+
"nn": "norwegian",
299+
"no": "norwegian",
300+
"pt": "portuguese",
301+
"ro": "romanian",
302+
"ru": "russian",
303+
"sr": "serbian",
304+
"es": "spanish",
305+
"sv": "swedish",
306+
"ta": "tamil",
307+
"tr": "turkish",
308+
"yi": "yiddish",
309+
}
310+
311+
312+
def language_to_regconfig(lang):
313+
"""Map a Plone language code to a PG text search configuration name.
314+
315+
Args:
316+
lang: Plone language code (e.g. "de", "en-us") or None/""
317+
318+
Returns:
319+
PG regconfig name (e.g. "german", "english") or "simple"
320+
"""
321+
if not lang:
322+
return "simple"
323+
base = lang.lower().split("-")[0].split("_")[0]
324+
return _LANG_TO_REGCONFIG.get(base, "simple")
325+
326+
272327
# --------------------------------------------------------------------------
273328
# Path utilities
274329
# --------------------------------------------------------------------------

src/plone/pgcatalog/config.py

Lines changed: 56 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -310,7 +310,9 @@ def get_extra_columns(self):
310310
ExtraColumn("idx", "%(idx)s"),
311311
ExtraColumn(
312312
"searchable_text",
313-
"to_tsvector('simple'::regconfig, %(searchable_text)s)",
313+
"to_tsvector("
314+
"pgcatalog_lang_to_regconfig(%(idx)s::jsonb->>'Language')"
315+
"::regconfig, %(searchable_text)s)",
314316
),
315317
]
316318

@@ -324,9 +326,16 @@ def get_schema_sql(self):
324326
from plone.pgcatalog.schema import CATALOG_COLUMNS
325327
from plone.pgcatalog.schema import CATALOG_FUNCTIONS
326328
from plone.pgcatalog.schema import CATALOG_INDEXES
329+
from plone.pgcatalog.schema import CATALOG_LANG_FUNCTION
327330
from plone.pgcatalog.schema import RRULE_FUNCTIONS
328331

329-
return CATALOG_COLUMNS + CATALOG_FUNCTIONS + CATALOG_INDEXES + RRULE_FUNCTIONS
332+
return (
333+
CATALOG_COLUMNS
334+
+ CATALOG_FUNCTIONS
335+
+ CATALOG_LANG_FUNCTION
336+
+ CATALOG_INDEXES
337+
+ RRULE_FUNCTIONS
338+
)
330339

331340
def process(self, zoid, class_mod, class_name, state):
332341
# Look up pending data from the thread-local store (set by
@@ -393,10 +402,55 @@ def register_catalog_processor(event):
393402
storage.register_state_processor(processor)
394403
log.info("Registered CatalogStateProcessor on %s", storage)
395404
_sync_registry_from_db(db)
405+
_ensure_text_indexes(storage)
396406
else:
397407
log.debug("Storage %s does not support state processors", storage)
398408

399409

410+
def _ensure_text_indexes(storage):
411+
"""Create GIN expression indexes for dynamically discovered TEXT indexes.
412+
413+
For each TEXT-type index with idx_key != None (not SearchableText),
414+
creates a GIN expression index on to_tsvector('simple', idx->>'{key}')
415+
if it doesn't already exist. Uses an autocommit connection to avoid
416+
REPEATABLE READ lock conflicts.
417+
"""
418+
from plone.pgcatalog.columns import get_registry
419+
from plone.pgcatalog.columns import IndexType
420+
from plone.pgcatalog.columns import validate_identifier
421+
422+
registry = get_registry()
423+
text_indexes = [
424+
(name, idx_key)
425+
for name, (idx_type, idx_key, _) in registry.items()
426+
if idx_type == IndexType.TEXT and idx_key is not None
427+
]
428+
if not text_indexes:
429+
return
430+
431+
dsn = getattr(storage, "_dsn", None)
432+
if not dsn:
433+
return
434+
435+
import psycopg
436+
437+
try:
438+
with psycopg.connect(dsn, autocommit=True) as conn:
439+
for name, idx_key in text_indexes:
440+
validate_identifier(idx_key)
441+
idx_name = f"idx_os_cat_{idx_key.lower()}_tsv"
442+
conn.execute(
443+
f"CREATE INDEX IF NOT EXISTS {idx_name} "
444+
f"ON object_state USING gin ("
445+
f"to_tsvector('simple'::regconfig, "
446+
f"COALESCE(idx->>'{idx_key}', ''))) "
447+
f"WHERE idx IS NOT NULL"
448+
)
449+
log.info("Ensured GIN text index %s for %s", idx_name, name)
450+
except Exception:
451+
log.warning("Failed to create text expression indexes", exc_info=True)
452+
453+
400454
def _register_dri_translators(catalog):
401455
"""Discover DateRecurringIndex instances and register IPGIndexTranslator utilities.
402456

src/plone/pgcatalog/query.py

Lines changed: 24 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,9 @@ def result(self):
171171
}
172172

173173
def process(self, query_dict):
174+
# Store full query dict for cross-index lookups (e.g. Language)
175+
self._query = query_dict
176+
174177
# Always filter for cataloged objects
175178
self.clauses.append("idx IS NOT NULL")
176179

@@ -388,19 +391,34 @@ def _handle_text(self, name, idx_key, spec):
388391
return
389392

390393
if idx_key is None:
391-
# SearchableText → tsvector full-text search
394+
# SearchableText → dedicated tsvector column, language-aware.
395+
# Uses pgcatalog_lang_to_regconfig() SQL function to map
396+
# Plone language codes to PG regconfig names at query time.
392397
p_text = self._pname("text")
393398
p_lang = self._pname("lang")
394399
self.clauses.append(
395-
f"searchable_text @@ plainto_tsquery(%({p_lang})s::regconfig, %({p_text})s)"
400+
f"searchable_text @@ plainto_tsquery("
401+
f"pgcatalog_lang_to_regconfig(%({p_lang})s)::regconfig, "
402+
f"%({p_text})s)"
396403
)
397404
self.params[p_text] = str(query_val)
398-
self.params[p_lang] = "simple"
405+
# Extract Language from the query dict if present
406+
lang_val = self._query.get("Language")
407+
if isinstance(lang_val, dict):
408+
lang_val = lang_val.get("query", "")
409+
self.params[p_lang] = str(lang_val) if lang_val else ""
399410
else:
400-
# Title / Description — treat as field exact match
411+
# Title / Description / addon ZCTextIndex →
412+
# tsvector expression on idx JSONB, 'simple' config.
413+
# Expression matches the GIN index created in schema.py /
414+
# _ensure_text_indexes() for index-backed queries.
401415
p = self._pname(name)
402-
self.clauses.append(f"idx @> %({p})s::jsonb")
403-
self.params[p] = Json({idx_key: query_val})
416+
self.clauses.append(
417+
f"to_tsvector('simple'::regconfig, "
418+
f"COALESCE(idx->>'{idx_key}', '')) "
419+
f"@@ plainto_tsquery('simple'::regconfig, %({p})s)"
420+
)
421+
self.params[p] = str(query_val)
404422

405423
# -- ExtendedPathIndex --------------------------------------------------
406424

0 commit comments

Comments
 (0)