Skip to content

MarkItDown converter is not available out of the box in workspace/dev node installs #467

@Jurij89

Description

@Jurij89

Summary

MarkItDown is intended to ship with the DKG node so document conversion works out of the box. During manual validation of chat attachments for PR #446 / issue #256, a PDF attachment reached the agent correctly but extraction was reported as skipped because the running WSL node had no registered PDF converter.

This is separate from attachment delivery: the file was imported, stored, and forwarded to the agent with metadata. The missing piece is that the MarkItDown binary was not present/registered in the running node, so application/pdf was not listed as an available extraction pipeline.

Observed Behavior

In a WSL checkout running the current branch:

ls -lah packages/cli/bin || true
command -v markitdown || true
curl -s http://127.0.0.1:9200/.well-known/skill.md | grep -i "Available extraction pipelines"

Output:

ls: cannot access 'packages/cli/bin': No such file or directory
- **Available extraction pipelines:** text/markdown

Attaching a PDF in the Node UI local-agent chat produced agent-visible metadata, but no extraction:

File: Invoice-8E6460E4-0046.pdf
Content type: application/pdf
Extraction status: skipped
Pipeline used: none
Triple count: 0
File hash: keccak256:e6c74d6210ee31e7aded39d754310b03b190ca725bfe9ff793e11ff481c1954b
Assertion: did:dkg:context-graph:file-attachment/assertion/.../Invoice-8E6460E4-0046.pdf

Manual attempt to build the converter locally then failed because the host did not have Python venv support installed:

pnpm --filter @origintrail-official/dkg run markitdown:build

Output:

> node ./scripts/bundle-markitdown-binaries.mjs --build-current-platform

MarkItDown bundle: Command failed: python3 -m venv /tmp/dkg-markitdown-build-.../venv
ERR_PNPM_RECURSIVE_RUN_FIRST_FAIL

Why This Matters

The value proposition of bundling MarkItDown is that operators and users should not need to install Python packages, python3-venv, PyInstaller, or MarkItDown manually before the node can process common document types such as PDF, DOCX, PPTX, XLSX, CSV, HTML, XML, and EPUB.

If the node starts with only text/markdown available, then PDF attachments/imports gracefully degrade to blob storage plus metadata. That is technically safe, but surprising for users because PDF support appears to be part of the shipped node capability.

Current Code Path / Likely Cause

Relevant code observed in the repo:

  • packages/cli/package.json runs this on install:
"postinstall": "node ./scripts/bundle-markitdown-binaries.mjs --quiet --current-platform --best-effort"
  • packages/cli/scripts/bundle-markitdown-binaries.mjs skips implicit release-asset download in workspace checkouts unless --all or --build-current-platform is used:
if (workspace && !opts.all && !opts.buildCurrentPlatform) {
  log('MarkItDown bundle: workspace checkout detected; skipping implicit release-asset download.');
  return;
}
  • The daemon only registers the converter if isMarkItDownAvailable() succeeds at startup:
const extractionRegistry = new ExtractionPipelineRegistry();
if (isMarkItDownAvailable()) {
  extractionRegistry.register(new MarkItDownConverter());
}
  • isMarkItDownAvailable() requires either:
    • a verified bundled binary and sidecars in packages/cli/bin, or
    • a markitdown executable on PATH.

In this WSL/dev scenario, neither was present, so the daemon correctly exposed only text/markdown.

Expected Behavior

For supported platforms, a freshly installed/runnable DKG node should have MarkItDown available without requiring users to install Python tooling or manually build the converter.

At minimum, after normal install/build/start steps:

curl -s http://127.0.0.1:9200/.well-known/skill.md | grep -i "Available extraction pipelines"

should include application/pdf and the other MarkItDown-backed content types.

A PDF import or chat attachment should produce:

  • extraction.status = "completed"
  • pipelineUsed = "application/pdf"
  • a mdIntermediateHash
  • dkg:markdownForm pointing to the markdown intermediate
  • deterministic structural triples/provenance when the converted Markdown contains extractable structure

Suggested Fix Directions

  1. Ensure release/install artifacts actually include the verified platform MarkItDown binary plus .sha256 and .meta.json sidecars.
  2. Decide whether workspace/dev checkouts should download the current-platform release asset by default, build from source, or present a clear one-command setup path that does not silently leave PDF extraction unavailable.
  3. Improve daemon/operator visibility when MarkItDown is missing. For example, status/UI could explicitly say: PDF/DOCX extraction unavailable: MarkItDown binary missing or failed validation. Currently the symptom is only that extraction pipelines omit those types and imports are skipped.
  4. Improve the markitdown:build failure message for common Linux/WSL prerequisites, especially missing python3-venv.

Acceptance Criteria

  • On a supported installed node, MarkItDown-backed extraction types are available out of the box.
  • /.well-known/skill.md lists application/pdf without manual Python/venv setup for normal installs.
  • PDF import through /api/assertion/import-file completes and returns pipelineUsed: "application/pdf" plus mdIntermediateHash.
  • PDF attachment through Node UI local-agent chat produces completed import metadata and agent-visible document context.
  • If MarkItDown is intentionally unavailable in a dev/workspace environment, the node surfaces an actionable diagnostic rather than making users infer the missing converter from skipped imports.
  • Tests or release checks cover the presence/validation of bundled MarkItDown assets for supported platforms.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions