Skip to content

Fix propagate StaticContainers ThreadLocal context for hybrid page workers#266

Open
SergioChan wants to merge 1 commit intoopendataloader-project:mainfrom
SergioChan:fix/threadlocal-context-hybrid-264
Open

Fix propagate StaticContainers ThreadLocal context for hybrid page workers#266
SergioChan wants to merge 1 commit intoopendataloader-project:mainfrom
SergioChan:fix/threadlocal-context-hybrid-264

Conversation

@SergioChan
Copy link

Issue resolved by this Pull Request:
Resolves #264

Summary of changes:

  • Add StaticContainersThreadContext utility to capture/apply verapdf StaticContainers ThreadLocal-backed values on worker threads.
  • Update HybridDocumentProcessor.processJavaPath to process pages via parallelStream() and re-apply captured context per worker before table/text processing.
  • Add regression test StaticContainersThreadContextTest covering worker-thread application before TextLineProcessor.processTextLines.

Validation:

  • Attempted: mvn -pl opendataloader-pdf-core -Dtest=StaticContainersThreadContextTest,TextLineProcessorTest test
  • Result: build cannot run in this environment because local JDK does not support project-required release target (error: release version 11 not supported).

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

Copilot AI review requested due to automatic review settings March 6, 2026 09:30
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses verapdf StaticContainers ThreadLocal initialization not being present on worker threads, enabling per-page parallel processing in the hybrid Java pipeline without NullPointerExceptions.

Changes:

  • Added StaticContainersThreadContext utility to capture/apply verapdf StaticContainers ThreadLocal state on worker threads.
  • Updated HybridDocumentProcessor.processJavaPath to process pages using parallelStream() and apply captured thread context per worker.
  • Added StaticContainersThreadContextTest to validate worker-thread application before running TextLineProcessor.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/StaticContainersThreadContext.java Introduces capture/apply helper for verapdf StaticContainers ThreadLocal-backed state.
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/HybridDocumentProcessor.java Switches Java-path per-page processing to parallelStream() and reapplies captured thread context in workers.
java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/StaticContainersThreadContextTest.java Adds regression coverage for worker-thread context application before TextLineProcessor.
Comments suppressed due to low confidence (1)

java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/HybridDocumentProcessor.java:281

  • processJavaPath now runs per-page processing in parallelStream(), but TableBorderProcessor.processTableBorders() mutates StaticContainers.getTableBordersCollection() (e.g., removeTableBorder(border, pageNumber)), which becomes a shared TableBordersCollection instance across worker threads after StaticContainersThreadContext.apply(...). This introduces a real data race / nondeterministic behavior and potential concurrent modification exceptions. Consider either keeping the table-border stage sequential, or giving each worker a thread-local copy (or page-scoped view) of TableBordersCollection, or synchronizing mutations so the collection isn't modified concurrently.
        // Process each page through the standard Java pipeline.
        // StaticContainers uses ThreadLocal fields, so worker threads must inherit context explicitly.
        StaticContainersThreadContext.Snapshot threadContext = StaticContainersThreadContext.capture();
        pageNumbers.parallelStream().forEach(pageNumber -> {
            try {
                StaticContainersThreadContext.apply(threadContext);
                List<IObject> pageContents = workingContents.get(pageNumber);
                pageContents = TableBorderProcessor.processTableBorders(pageContents, pageNumber);
                pageContents = pageContents.stream()
                    .filter(x -> !(x instanceof LineChunk))
                    .collect(Collectors.toList());
                pageContents = TextLineProcessor.processTextLines(pageContents);
                pageContents = SpecialTableProcessor.detectSpecialTables(pageContents);
                workingContents.set(pageNumber, pageContents);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +39 to +53
if (snapshot.document != null) {
StaticContainers.setDocument(snapshot.document);
}
if (snapshot.tableBordersCollection != null) {
StaticContainers.setTableBordersCollection(snapshot.tableBordersCollection);
}
if (snapshot.keepLineBreaks != null) {
StaticContainers.setKeepLineBreaks(snapshot.keepLineBreaks);
}
if (snapshot.isDataLoader != null) {
StaticContainers.setIsDataLoader(snapshot.isDataLoader);
}
if (snapshot.isIgnoreCharactersWithoutUnicode != null) {
StaticContainers.setIsIgnoreCharactersWithoutUnicode(snapshot.isIgnoreCharactersWithoutUnicode);
}
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StaticContainersThreadContext.apply(...) only sets ThreadLocal-backed values when the captured value is non-null. On pooled worker threads (e.g., ForkJoinPool.commonPool() used by parallelStream()), this can leave stale ThreadLocal values from a previous document run when the current snapshot has null (notably tableBordersCollection, which can legitimately be null and is checked elsewhere). Apply should overwrite/clear ThreadLocal values to exactly match the snapshot (including setting null / removing) to avoid cross-document contamination.

Suggested change
if (snapshot.document != null) {
StaticContainers.setDocument(snapshot.document);
}
if (snapshot.tableBordersCollection != null) {
StaticContainers.setTableBordersCollection(snapshot.tableBordersCollection);
}
if (snapshot.keepLineBreaks != null) {
StaticContainers.setKeepLineBreaks(snapshot.keepLineBreaks);
}
if (snapshot.isDataLoader != null) {
StaticContainers.setIsDataLoader(snapshot.isDataLoader);
}
if (snapshot.isIgnoreCharactersWithoutUnicode != null) {
StaticContainers.setIsIgnoreCharactersWithoutUnicode(snapshot.isIgnoreCharactersWithoutUnicode);
}
// Always reapply the full snapshot so that ThreadLocal-backed values
// are reset/cleared to exactly match the captured state, even when null.
StaticContainers.setDocument(snapshot.document);
StaticContainers.setTableBordersCollection(snapshot.tableBordersCollection);
StaticContainers.setKeepLineBreaks(snapshot.keepLineBreaks);
StaticContainers.setIsDataLoader(snapshot.isDataLoader);
StaticContainers.setIsIgnoreCharactersWithoutUnicode(snapshot.isIgnoreCharactersWithoutUnicode);

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +40
try (ForkJoinPool pool = new ForkJoinPool(1)) {
List<IObject> processed = pool.submit(() -> {
StaticContainersThreadContext.apply(snapshot);
return TextLineProcessor.processTextLines(contents);
}).get();

Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new regression test validates the positive case (applying the snapshot avoids NPE), but it doesn't assert the pre-fix failure mode (that running TextLineProcessor.processTextLines(...) on a worker thread without applying the context throws). Adding that negative assertion would make the test stronger and ensure it actually guards against regressions where the context propagation call is accidentally removed.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: verapdf StaticContainers uses ThreadLocal — blocks per-page parallel processing

2 participants