Investigate and resolve metadata/content differences between local JAR and Tika server

## Background

During test suite fixes, two categories of difference were discovered between the local `tika-app-3.0.0.jar` and the `apache/tika:latest-full` Docker image used in integration tests. Both are currently papered over with workarounds in the test suite rather than properly understood and resolved.

---

## Issue 1 — Missing XMP metadata keys (PDF)

When parsing `testPDF_childAttachments.pdf`, the Tika server returns a set of XMP-namespaced metadata variants that the local JAR does not extract:

- `xmp:dc:creator`
- `xmp:dc:description` / `xmp:dc:description:x-default`
- `xmp:dc:title` / `xmp:dc:title:x-default`
- `dc:description:x-default` / `dc:title:x-default`
- `xmp:pdf:Producer`
- `xmpMM:InstanceID`

**Questions to answer:**
- Is this a version difference between the JAR and the server image?
- Is the server running additional XMP post-processing that we are not?
- Should `tikara` expose these fields, and if so, how?

**Current workaround:** `PDF_XMP_SKIP_KEYS` in `test/test_parse.py` adds these keys to `expected_missing` for PDF test cases.

---

## Issue 2 — OCR content interleaved into main text stream

With Tesseract installed, the local Tika OCR-processes every embedded image in a document and interleaves the results (including garbage text and image filenames) into the main text stream at the positions where images appear. The Tika server comparison test uses `X-Tika-Skip-Embedded: true`, so the server output is clean text only.

This means the strict `assert tika_content in our_content` substring check no longer holds when OCR is active.

**Questions to answer:**
- Should `tikara` expose a way to disable OCR / skip embedded documents in `parse()`?
- Should OCR output from embedded images be surfaced separately rather than inline?
- Is there a `ParseContext` / `TesseractOCRConfig` knob we should expose in the public API?

**Current workaround:** `test_parse_content_compare_with_tika_server` uses `SequenceMatcher` similarity ratio (≥ 0.95) instead of a substring assertion.

---

## Acceptance Criteria

- [ ] Root cause of the XMP key differences is documented
- [ ] Root cause of the OCR interleaving behaviour is documented
- [ ] At least one of: (a) the library behaviour is changed to match the server, (b) the API is extended to control the behaviour, or (c) the differences are accepted as intentional and documented
- [ ] `PDF_XMP_SKIP_KEYS` and the `SequenceMatcher` workaround are replaced with proper assertions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate and resolve metadata/content differences between local JAR and Tika server #13

Background

Issue 1 — Missing XMP metadata keys (PDF)

Issue 2 — OCR content interleaved into main text stream

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Investigate and resolve metadata/content differences between local JAR and Tika server #13

Description

Background

Issue 1 — Missing XMP metadata keys (PDF)

Issue 2 — OCR content interleaved into main text stream

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions