Skip to content

Commit dc7645f

Browse files
authored
Merge pull request #26 from bluedynamics/feature/async-tika-extraction
Add async Tika text extraction for binary content
2 parents 0edf24c + 248669a commit dc7645f

18 files changed

Lines changed: 1714 additions & 14 deletions

docs/sources/explanation/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ titlesonly: true
1313
architecture
1414
why-postgresql
1515
fulltext-search
16+
tika-extraction
1617
performance
1718
security
1819
bm25-design
Lines changed: 218 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,218 @@
1+
<!-- diataxis: explanation -->
2+
3+
# Tika Text Extraction Architecture
4+
5+
## The Problem
6+
7+
Plone's default search indexes text from rich-text fields: Title,
8+
Description, and the HTML body of a Page or News Item. This works
9+
because these fields contain plain text that Plone can read directly.
10+
11+
Binary files — PDFs, Word documents, spreadsheets, images — contain
12+
text that is locked inside proprietary or compressed formats. Plone
13+
cannot extract it natively. Without extraction, uploading a PDF titled
14+
"Q4 Financial Report" makes it findable by title, but the 50 pages of
15+
content inside the PDF are invisible to search.
16+
17+
Elasticsearch solves this with its Tika ingest pipeline. plone.pgcatalog
18+
brings the same capability to PostgreSQL.
19+
20+
## Design Decisions
21+
22+
### Why Apache Tika?
23+
24+
Tika extracts text from over 1400 file formats via a single stateless
25+
HTTP API. It handles PDFs (including scanned ones via Tesseract OCR),
26+
Office documents, OpenDocument formats, images, and more. It is the
27+
same technology Elasticsearch uses internally.
28+
29+
### Why PostgreSQL as the Job Queue?
30+
31+
Redis or RabbitMQ would add operational complexity. Since plone.pgcatalog
32+
already depends on PostgreSQL, we use it as the queue too:
33+
34+
- **Transactional enqueue**: Jobs are inserted in the same transaction
35+
as the ZODB commit. If the transaction rolls back, the job disappears
36+
too. No orphaned jobs.
37+
- **LISTEN/NOTIFY**: PostgreSQL's built-in pub/sub wakes the worker
38+
instantly when a new job arrives. No polling delay.
39+
- **SKIP LOCKED**: Multiple workers can dequeue safely without
40+
contention. Each worker claims one job at a time; others skip locked
41+
rows.
42+
- **Visibility**: Queue state is queryable via standard SQL. No
43+
separate monitoring infrastructure needed.
44+
45+
### Why Asynchronous?
46+
47+
Text extraction is slow — a large PDF can take seconds. Running it
48+
synchronously during `catalog_object()` would block the Zope request
49+
thread, making content saves unacceptably slow. The asynchronous
50+
approach keeps the synchronous path fast (Title/Description/body are
51+
indexed immediately) while extraction runs in the background.
52+
53+
### Why Not Store the Full Extracted Text?
54+
55+
The extracted text is not stored as a column. Instead, it is
56+
transformed into a tsvector (and optionally BM25 vectors) and merged
57+
into the existing `searchable_text` column. This is more space-efficient
58+
and matches how PostgreSQL full-text search works: the search engine
59+
operates on tsvectors, not raw text.
60+
61+
## Data Flow
62+
63+
```{mermaid}
64+
sequenceDiagram
65+
participant Plone as Plone (catalog_object)
66+
participant Proc as CatalogStateProcessor
67+
participant PG as PostgreSQL
68+
participant Worker as TikaWorker
69+
participant Tika as Apache Tika
70+
71+
Plone->>Proc: process(zoid, state)
72+
Note over Proc: Extract content_type<br/>from primary field
73+
Proc->>Proc: Accumulate candidate<br/>if extractable type
74+
Plone->>Proc: finalize(cursor)
75+
Proc->>PG: SELECT blob_state WHERE zoid IN (...)
76+
PG-->>Proc: rows with blob data
77+
Proc->>PG: INSERT INTO text_extraction_queue
78+
Note over PG: NOTIFY trigger fires
79+
80+
PG-->>Worker: NOTIFY text_extraction_ready
81+
Worker->>PG: UPDATE ... FOR UPDATE SKIP LOCKED<br/>RETURNING job
82+
Worker->>PG: SELECT data FROM blob_state
83+
PG-->>Worker: blob bytes
84+
Worker->>Tika: PUT /tika (blob bytes)
85+
Tika-->>Worker: extracted text
86+
Worker->>PG: SELECT pgcatalog_merge_extracted_text(zoid, text)
87+
Worker->>PG: UPDATE status = 'done'
88+
```
89+
90+
### Step-by-Step
91+
92+
1. **catalog_object()** extracts the object's MIME content type via
93+
`extract_content_type()` (tries `IPrimaryFieldInfo` first, then
94+
`content_type` attribute). The content type is included in the
95+
pending annotation.
96+
97+
2. **CatalogStateProcessor.process()** checks if `PGCATALOG_TIKA_URL`
98+
is set and the content type is in the extractable set. If so, the
99+
zoid is added to `self._tika_candidates`.
100+
101+
3. **CatalogStateProcessor.finalize()** runs in the same PostgreSQL
102+
transaction as the ZODB commit. It queries `blob_state` to find which
103+
candidates actually have blobs, then inserts jobs into
104+
`text_extraction_queue`. An `ON CONFLICT DO NOTHING` clause makes
105+
this idempotent.
106+
107+
4. The **NOTIFY trigger** on the queue table fires, sending a
108+
`text_extraction_ready` notification with the job ID.
109+
110+
5. The **TikaWorker** receives the notification (or wakes up on its
111+
poll interval). It dequeues one job using
112+
`UPDATE ... FOR UPDATE SKIP LOCKED RETURNING`, which atomically
113+
claims the job. Other workers skip this row.
114+
115+
6. The worker **fetches the blob** from `blob_state` (PG bytea) or S3
116+
(for S3-tiered blobs above the size threshold).
117+
118+
7. The worker sends the blob to **Tika** via `PUT /tika` with the
119+
content type header. Tika returns plain text.
120+
121+
8. The worker calls **`pgcatalog_merge_extracted_text(zoid, text)`**,
122+
a PL/pgSQL function that appends the extracted text to the
123+
existing `searchable_text` tsvector at weight `C`. When BM25 is
124+
active, the function also rebuilds BM25 vectors with the
125+
Title/Description/extracted text combined.
126+
127+
9. The job status is updated to `done`. On failure, the job returns
128+
to `pending` (up to `max_attempts` retries).
129+
130+
## Weight Hierarchy
131+
132+
The `searchable_text` tsvector uses PostgreSQL's four weight classes
133+
to rank content by importance:
134+
135+
| Weight | Content | BM25 Boost | Source |
136+
|--------|---------|-----------|--------|
137+
| **A** | Title | 3x (repeated 3 times) | Synchronous (catalog_object) |
138+
| **B** | Description | 1x | Synchronous (catalog_object) |
139+
| **C** | Extracted blob text | 1x | Asynchronous (Tika worker) |
140+
| **D** | Rich-text body | 1x | Synchronous (catalog_object) |
141+
142+
A search for "quantum computing" ranks a document with that phrase in
143+
the title higher than one where it only appears in an attached PDF.
144+
PostgreSQL's `ts_rank_cd()` (and BM25's scoring) respect these weights
145+
automatically.
146+
147+
## Queue Table
148+
149+
The `text_extraction_queue` table is created when `PGCATALOG_TIKA_URL`
150+
is set. See {doc}`../reference/schema` for the full schema.
151+
152+
Key design choices:
153+
154+
- **UNIQUE(zoid, tid)**: Prevents duplicate jobs for the same object
155+
version.
156+
- **Partial index on `status = 'pending'`**: Makes dequeue queries
157+
fast regardless of how many completed jobs exist.
158+
- **NOTIFY trigger**: Fires on every INSERT, waking the worker
159+
instantly.
160+
- **attempts/max_attempts**: Built-in retry with configurable limit
161+
(default: 3). Failed jobs stay visible for debugging.
162+
163+
## Worker Modes
164+
165+
### In-Process (Development)
166+
167+
When `PGCATALOG_TIKA_INPROCESS=true`, the worker runs as a daemon
168+
thread inside the Zope process. It opens its own PostgreSQL connection
169+
and HTTP client — it shares nothing with Zope's ZODB connections or
170+
transaction machinery.
171+
172+
The thread is marked `daemon=True`, meaning it dies automatically when
173+
the Zope process exits. No separate shutdown handling is needed.
174+
175+
This mode is convenient for development and small deployments. The
176+
trade-off is that extraction work competes with Zope for CPU and memory.
177+
178+
### Standalone (Production)
179+
180+
The `pgcatalog-tika-worker` CLI runs as a separate process (or
181+
container). It depends only on `psycopg` and `httpx` — no Zope, no
182+
Plone, no ZODB. This makes it lightweight and easy to deploy.
183+
184+
Multiple workers can run concurrently. The `SKIP LOCKED` dequeue
185+
pattern ensures each job is processed exactly once, even under
186+
concurrent load.
187+
188+
## Image Indexing
189+
190+
Tika includes Tesseract OCR, which can extract text from images
191+
(JPEG, PNG, TIFF, WebP, GIF). By default, plone.pgcatalog configures
192+
all common image types as extractable.
193+
194+
This means that after enabling Tika:
195+
196+
- A photo of a whiteboard becomes searchable by the text on the board
197+
- A scanned invoice becomes searchable by its content
198+
- An infographic becomes searchable by its labels and annotations
199+
200+
Plone does not make image blobs searchable by default (there was no
201+
extraction mechanism). With Tika, this happens automatically for all
202+
Image content types that have blobs.
203+
204+
## Interaction with Existing Search
205+
206+
Enabling Tika does not change how existing search works:
207+
208+
- **Title and Description** are still indexed synchronously during
209+
`catalog_object()`, with immediate availability.
210+
- **Rich-text body** (SearchableText from `portal_transforms`) is
211+
still indexed synchronously.
212+
- **Tika extraction** adds to the existing tsvector asynchronously.
213+
There is a brief window (seconds to minutes, depending on queue
214+
depth and Tika processing time) where the blob content is not yet
215+
searchable.
216+
217+
Sites that do not set `PGCATALOG_TIKA_URL` see no change in behavior,
218+
schema, or performance. The queue table is not even created.

0 commit comments

Comments
 (0)