Skip to content

Investigate and resolve metadata/content differences between local JAR and Tika server #13

@baughmann

Description

@baughmann

Background

During test suite fixes, two categories of difference were discovered between the local tika-app-3.0.0.jar and the apache/tika:latest-full Docker image used in integration tests. Both are currently papered over with workarounds in the test suite rather than properly understood and resolved.


Issue 1 — Missing XMP metadata keys (PDF)

When parsing testPDF_childAttachments.pdf, the Tika server returns a set of XMP-namespaced metadata variants that the local JAR does not extract:

  • xmp:dc:creator
  • xmp:dc:description / xmp:dc:description:x-default
  • xmp:dc:title / xmp:dc:title:x-default
  • dc:description:x-default / dc:title:x-default
  • xmp:pdf:Producer
  • xmpMM:InstanceID

Questions to answer:

  • Is this a version difference between the JAR and the server image?
  • Is the server running additional XMP post-processing that we are not?
  • Should tikara expose these fields, and if so, how?

Current workaround: PDF_XMP_SKIP_KEYS in test/test_parse.py adds these keys to expected_missing for PDF test cases.


Issue 2 — OCR content interleaved into main text stream

With Tesseract installed, the local Tika OCR-processes every embedded image in a document and interleaves the results (including garbage text and image filenames) into the main text stream at the positions where images appear. The Tika server comparison test uses X-Tika-Skip-Embedded: true, so the server output is clean text only.

This means the strict assert tika_content in our_content substring check no longer holds when OCR is active.

Questions to answer:

  • Should tikara expose a way to disable OCR / skip embedded documents in parse()?
  • Should OCR output from embedded images be surfaced separately rather than inline?
  • Is there a ParseContext / TesseractOCRConfig knob we should expose in the public API?

Current workaround: test_parse_content_compare_with_tika_server uses SequenceMatcher similarity ratio (≥ 0.95) instead of a substring assertion.


Acceptance Criteria

  • Root cause of the XMP key differences is documented
  • Root cause of the OCR interleaving behaviour is documented
  • At least one of: (a) the library behaviour is changed to match the server, (b) the API is extended to control the behaviour, or (c) the differences are accepted as intentional and documented
  • PDF_XMP_SKIP_KEYS and the SequenceMatcher workaround are replaced with proper assertions

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions