Fix propagate StaticContainers ThreadLocal context for hybrid page workers#266
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses verapdf StaticContainers ThreadLocal initialization not being present on worker threads, enabling per-page parallel processing in the hybrid Java pipeline without NullPointerExceptions.
Changes:
- Added
StaticContainersThreadContextutility to capture/apply verapdfStaticContainersThreadLocalstate on worker threads. - Updated
HybridDocumentProcessor.processJavaPathto process pages usingparallelStream()and apply captured thread context per worker. - Added
StaticContainersThreadContextTestto validate worker-thread application before runningTextLineProcessor.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/StaticContainersThreadContext.java |
Introduces capture/apply helper for verapdf StaticContainers ThreadLocal-backed state. |
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/HybridDocumentProcessor.java |
Switches Java-path per-page processing to parallelStream() and reapplies captured thread context in workers. |
java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/StaticContainersThreadContextTest.java |
Adds regression coverage for worker-thread context application before TextLineProcessor. |
Comments suppressed due to low confidence (1)
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/HybridDocumentProcessor.java:281
processJavaPathnow runs per-page processing inparallelStream(), butTableBorderProcessor.processTableBorders()mutatesStaticContainers.getTableBordersCollection()(e.g.,removeTableBorder(border, pageNumber)), which becomes a sharedTableBordersCollectioninstance across worker threads afterStaticContainersThreadContext.apply(...). This introduces a real data race / nondeterministic behavior and potential concurrent modification exceptions. Consider either keeping the table-border stage sequential, or giving each worker a thread-local copy (or page-scoped view) ofTableBordersCollection, or synchronizing mutations so the collection isn't modified concurrently.
// Process each page through the standard Java pipeline.
// StaticContainers uses ThreadLocal fields, so worker threads must inherit context explicitly.
StaticContainersThreadContext.Snapshot threadContext = StaticContainersThreadContext.capture();
pageNumbers.parallelStream().forEach(pageNumber -> {
try {
StaticContainersThreadContext.apply(threadContext);
List<IObject> pageContents = workingContents.get(pageNumber);
pageContents = TableBorderProcessor.processTableBorders(pageContents, pageNumber);
pageContents = pageContents.stream()
.filter(x -> !(x instanceof LineChunk))
.collect(Collectors.toList());
pageContents = TextLineProcessor.processTextLines(pageContents);
pageContents = SpecialTableProcessor.detectSpecialTables(pageContents);
workingContents.set(pageNumber, pageContents);
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (snapshot.document != null) { | ||
| StaticContainers.setDocument(snapshot.document); | ||
| } | ||
| if (snapshot.tableBordersCollection != null) { | ||
| StaticContainers.setTableBordersCollection(snapshot.tableBordersCollection); | ||
| } | ||
| if (snapshot.keepLineBreaks != null) { | ||
| StaticContainers.setKeepLineBreaks(snapshot.keepLineBreaks); | ||
| } | ||
| if (snapshot.isDataLoader != null) { | ||
| StaticContainers.setIsDataLoader(snapshot.isDataLoader); | ||
| } | ||
| if (snapshot.isIgnoreCharactersWithoutUnicode != null) { | ||
| StaticContainers.setIsIgnoreCharactersWithoutUnicode(snapshot.isIgnoreCharactersWithoutUnicode); | ||
| } |
There was a problem hiding this comment.
StaticContainersThreadContext.apply(...) only sets ThreadLocal-backed values when the captured value is non-null. On pooled worker threads (e.g., ForkJoinPool.commonPool() used by parallelStream()), this can leave stale ThreadLocal values from a previous document run when the current snapshot has null (notably tableBordersCollection, which can legitimately be null and is checked elsewhere). Apply should overwrite/clear ThreadLocal values to exactly match the snapshot (including setting null / removing) to avoid cross-document contamination.
| if (snapshot.document != null) { | |
| StaticContainers.setDocument(snapshot.document); | |
| } | |
| if (snapshot.tableBordersCollection != null) { | |
| StaticContainers.setTableBordersCollection(snapshot.tableBordersCollection); | |
| } | |
| if (snapshot.keepLineBreaks != null) { | |
| StaticContainers.setKeepLineBreaks(snapshot.keepLineBreaks); | |
| } | |
| if (snapshot.isDataLoader != null) { | |
| StaticContainers.setIsDataLoader(snapshot.isDataLoader); | |
| } | |
| if (snapshot.isIgnoreCharactersWithoutUnicode != null) { | |
| StaticContainers.setIsIgnoreCharactersWithoutUnicode(snapshot.isIgnoreCharactersWithoutUnicode); | |
| } | |
| // Always reapply the full snapshot so that ThreadLocal-backed values | |
| // are reset/cleared to exactly match the captured state, even when null. | |
| StaticContainers.setDocument(snapshot.document); | |
| StaticContainers.setTableBordersCollection(snapshot.tableBordersCollection); | |
| StaticContainers.setKeepLineBreaks(snapshot.keepLineBreaks); | |
| StaticContainers.setIsDataLoader(snapshot.isDataLoader); | |
| StaticContainers.setIsIgnoreCharactersWithoutUnicode(snapshot.isIgnoreCharactersWithoutUnicode); |
| try (ForkJoinPool pool = new ForkJoinPool(1)) { | ||
| List<IObject> processed = pool.submit(() -> { | ||
| StaticContainersThreadContext.apply(snapshot); | ||
| return TextLineProcessor.processTextLines(contents); | ||
| }).get(); | ||
|
|
There was a problem hiding this comment.
The new regression test validates the positive case (applying the snapshot avoids NPE), but it doesn't assert the pre-fix failure mode (that running TextLineProcessor.processTextLines(...) on a worker thread without applying the context throws). Adding that negative assertion would make the test stronger and ensure it actually guards against regressions where the context propagation call is accidentally removed.
Issue resolved by this Pull Request:
Resolves #264
Summary of changes:
StaticContainersThreadContextutility to capture/apply verapdfStaticContainersThreadLocal-backed values on worker threads.HybridDocumentProcessor.processJavaPathto process pages viaparallelStream()and re-apply captured context per worker before table/text processing.StaticContainersThreadContextTestcovering worker-thread application beforeTextLineProcessor.processTextLines.Validation:
mvn -pl opendataloader-pdf-core -Dtest=StaticContainersThreadContextTest,TextLineProcessorTest testerror: release version 11 not supported).Checklist: