Summary
MarkItDown is intended to ship with the DKG node so document conversion works out of the box. During manual validation of chat attachments for PR #446 / issue #256, a PDF attachment reached the agent correctly but extraction was reported as skipped because the running WSL node had no registered PDF converter.
This is separate from attachment delivery: the file was imported, stored, and forwarded to the agent with metadata. The missing piece is that the MarkItDown binary was not present/registered in the running node, so application/pdf was not listed as an available extraction pipeline.
Observed Behavior
In a WSL checkout running the current branch:
ls -lah packages/cli/bin || true
command -v markitdown || true
curl -s http://127.0.0.1:9200/.well-known/skill.md | grep -i "Available extraction pipelines"
Output:
ls: cannot access 'packages/cli/bin': No such file or directory
- **Available extraction pipelines:** text/markdown
Attaching a PDF in the Node UI local-agent chat produced agent-visible metadata, but no extraction:
File: Invoice-8E6460E4-0046.pdf
Content type: application/pdf
Extraction status: skipped
Pipeline used: none
Triple count: 0
File hash: keccak256:e6c74d6210ee31e7aded39d754310b03b190ca725bfe9ff793e11ff481c1954b
Assertion: did:dkg:context-graph:file-attachment/assertion/.../Invoice-8E6460E4-0046.pdf
Manual attempt to build the converter locally then failed because the host did not have Python venv support installed:
pnpm --filter @origintrail-official/dkg run markitdown:build
Output:
> node ./scripts/bundle-markitdown-binaries.mjs --build-current-platform
MarkItDown bundle: Command failed: python3 -m venv /tmp/dkg-markitdown-build-.../venv
ERR_PNPM_RECURSIVE_RUN_FIRST_FAIL
Why This Matters
The value proposition of bundling MarkItDown is that operators and users should not need to install Python packages, python3-venv, PyInstaller, or MarkItDown manually before the node can process common document types such as PDF, DOCX, PPTX, XLSX, CSV, HTML, XML, and EPUB.
If the node starts with only text/markdown available, then PDF attachments/imports gracefully degrade to blob storage plus metadata. That is technically safe, but surprising for users because PDF support appears to be part of the shipped node capability.
Current Code Path / Likely Cause
Relevant code observed in the repo:
packages/cli/package.json runs this on install:
"postinstall": "node ./scripts/bundle-markitdown-binaries.mjs --quiet --current-platform --best-effort"
packages/cli/scripts/bundle-markitdown-binaries.mjs skips implicit release-asset download in workspace checkouts unless --all or --build-current-platform is used:
if (workspace && !opts.all && !opts.buildCurrentPlatform) {
log('MarkItDown bundle: workspace checkout detected; skipping implicit release-asset download.');
return;
}
- The daemon only registers the converter if
isMarkItDownAvailable() succeeds at startup:
const extractionRegistry = new ExtractionPipelineRegistry();
if (isMarkItDownAvailable()) {
extractionRegistry.register(new MarkItDownConverter());
}
isMarkItDownAvailable() requires either:
- a verified bundled binary and sidecars in
packages/cli/bin, or
- a
markitdown executable on PATH.
In this WSL/dev scenario, neither was present, so the daemon correctly exposed only text/markdown.
Expected Behavior
For supported platforms, a freshly installed/runnable DKG node should have MarkItDown available without requiring users to install Python tooling or manually build the converter.
At minimum, after normal install/build/start steps:
curl -s http://127.0.0.1:9200/.well-known/skill.md | grep -i "Available extraction pipelines"
should include application/pdf and the other MarkItDown-backed content types.
A PDF import or chat attachment should produce:
extraction.status = "completed"
pipelineUsed = "application/pdf"
- a
mdIntermediateHash
dkg:markdownForm pointing to the markdown intermediate
- deterministic structural triples/provenance when the converted Markdown contains extractable structure
Suggested Fix Directions
- Ensure release/install artifacts actually include the verified platform MarkItDown binary plus
.sha256 and .meta.json sidecars.
- Decide whether workspace/dev checkouts should download the current-platform release asset by default, build from source, or present a clear one-command setup path that does not silently leave PDF extraction unavailable.
- Improve daemon/operator visibility when MarkItDown is missing. For example, status/UI could explicitly say:
PDF/DOCX extraction unavailable: MarkItDown binary missing or failed validation. Currently the symptom is only that extraction pipelines omit those types and imports are skipped.
- Improve the
markitdown:build failure message for common Linux/WSL prerequisites, especially missing python3-venv.
Acceptance Criteria
Related
Summary
MarkItDown is intended to ship with the DKG node so document conversion works out of the box. During manual validation of chat attachments for PR #446 / issue #256, a PDF attachment reached the agent correctly but extraction was reported as
skippedbecause the running WSL node had no registered PDF converter.This is separate from attachment delivery: the file was imported, stored, and forwarded to the agent with metadata. The missing piece is that the MarkItDown binary was not present/registered in the running node, so
application/pdfwas not listed as an available extraction pipeline.Observed Behavior
In a WSL checkout running the current branch:
Output:
Attaching a PDF in the Node UI local-agent chat produced agent-visible metadata, but no extraction:
Manual attempt to build the converter locally then failed because the host did not have Python venv support installed:
Output:
Why This Matters
The value proposition of bundling MarkItDown is that operators and users should not need to install Python packages,
python3-venv, PyInstaller, or MarkItDown manually before the node can process common document types such as PDF, DOCX, PPTX, XLSX, CSV, HTML, XML, and EPUB.If the node starts with only
text/markdownavailable, then PDF attachments/imports gracefully degrade to blob storage plus metadata. That is technically safe, but surprising for users because PDF support appears to be part of the shipped node capability.Current Code Path / Likely Cause
Relevant code observed in the repo:
packages/cli/package.jsonruns this on install:packages/cli/scripts/bundle-markitdown-binaries.mjsskips implicit release-asset download in workspace checkouts unless--allor--build-current-platformis used:isMarkItDownAvailable()succeeds at startup:isMarkItDownAvailable()requires either:packages/cli/bin, ormarkitdownexecutable on PATH.In this WSL/dev scenario, neither was present, so the daemon correctly exposed only
text/markdown.Expected Behavior
For supported platforms, a freshly installed/runnable DKG node should have MarkItDown available without requiring users to install Python tooling or manually build the converter.
At minimum, after normal install/build/start steps:
should include
application/pdfand the other MarkItDown-backed content types.A PDF import or chat attachment should produce:
extraction.status = "completed"pipelineUsed = "application/pdf"mdIntermediateHashdkg:markdownFormpointing to the markdown intermediateSuggested Fix Directions
.sha256and.meta.jsonsidecars.PDF/DOCX extraction unavailable: MarkItDown binary missing or failed validation.Currently the symptom is only that extraction pipelines omit those types and imports are skipped.markitdown:buildfailure message for common Linux/WSL prerequisites, especially missingpython3-venv.Acceptance Criteria
/.well-known/skill.mdlistsapplication/pdfwithout manual Python/venv setup for normal installs./api/assertion/import-filecompletes and returnspipelineUsed: "application/pdf"plusmdIntermediateHash.skippedimports.Related