Background
During test suite fixes, two categories of difference were discovered between the local tika-app-3.0.0.jar and the apache/tika:latest-full Docker image used in integration tests. Both are currently papered over with workarounds in the test suite rather than properly understood and resolved.
Issue 1 — Missing XMP metadata keys (PDF)
When parsing testPDF_childAttachments.pdf, the Tika server returns a set of XMP-namespaced metadata variants that the local JAR does not extract:
xmp:dc:creator
xmp:dc:description / xmp:dc:description:x-default
xmp:dc:title / xmp:dc:title:x-default
dc:description:x-default / dc:title:x-default
xmp:pdf:Producer
xmpMM:InstanceID
Questions to answer:
- Is this a version difference between the JAR and the server image?
- Is the server running additional XMP post-processing that we are not?
- Should
tikara expose these fields, and if so, how?
Current workaround: PDF_XMP_SKIP_KEYS in test/test_parse.py adds these keys to expected_missing for PDF test cases.
Issue 2 — OCR content interleaved into main text stream
With Tesseract installed, the local Tika OCR-processes every embedded image in a document and interleaves the results (including garbage text and image filenames) into the main text stream at the positions where images appear. The Tika server comparison test uses X-Tika-Skip-Embedded: true, so the server output is clean text only.
This means the strict assert tika_content in our_content substring check no longer holds when OCR is active.
Questions to answer:
- Should
tikara expose a way to disable OCR / skip embedded documents in parse()?
- Should OCR output from embedded images be surfaced separately rather than inline?
- Is there a
ParseContext / TesseractOCRConfig knob we should expose in the public API?
Current workaround: test_parse_content_compare_with_tika_server uses SequenceMatcher similarity ratio (≥ 0.95) instead of a substring assertion.
Acceptance Criteria
Background
During test suite fixes, two categories of difference were discovered between the local
tika-app-3.0.0.jarand theapache/tika:latest-fullDocker image used in integration tests. Both are currently papered over with workarounds in the test suite rather than properly understood and resolved.Issue 1 — Missing XMP metadata keys (PDF)
When parsing
testPDF_childAttachments.pdf, the Tika server returns a set of XMP-namespaced metadata variants that the local JAR does not extract:xmp:dc:creatorxmp:dc:description/xmp:dc:description:x-defaultxmp:dc:title/xmp:dc:title:x-defaultdc:description:x-default/dc:title:x-defaultxmp:pdf:ProducerxmpMM:InstanceIDQuestions to answer:
tikaraexpose these fields, and if so, how?Current workaround:
PDF_XMP_SKIP_KEYSintest/test_parse.pyadds these keys toexpected_missingfor PDF test cases.Issue 2 — OCR content interleaved into main text stream
With Tesseract installed, the local Tika OCR-processes every embedded image in a document and interleaves the results (including garbage text and image filenames) into the main text stream at the positions where images appear. The Tika server comparison test uses
X-Tika-Skip-Embedded: true, so the server output is clean text only.This means the strict
assert tika_content in our_contentsubstring check no longer holds when OCR is active.Questions to answer:
tikaraexpose a way to disable OCR / skip embedded documents inparse()?ParseContext/TesseractOCRConfigknob we should expose in the public API?Current workaround:
test_parse_content_compare_with_tika_serverusesSequenceMatchersimilarity ratio (≥ 0.95) instead of a substring assertion.Acceptance Criteria
PDF_XMP_SKIP_KEYSand theSequenceMatcherworkaround are replaced with proper assertions