MarkItDown converter is not available out of the box in workspace/dev node installs

## Summary

MarkItDown is intended to ship with the DKG node so document conversion works out of the box. During manual validation of chat attachments for PR #446 / issue #256, a PDF attachment reached the agent correctly but extraction was reported as `skipped` because the running WSL node had no registered PDF converter.

This is separate from attachment delivery: the file was imported, stored, and forwarded to the agent with metadata. The missing piece is that the MarkItDown binary was not present/registered in the running node, so `application/pdf` was not listed as an available extraction pipeline.

## Observed Behavior

In a WSL checkout running the current branch:

```bash
ls -lah packages/cli/bin || true
command -v markitdown || true
curl -s http://127.0.0.1:9200/.well-known/skill.md | grep -i "Available extraction pipelines"
```

Output:

```text
ls: cannot access 'packages/cli/bin': No such file or directory
- **Available extraction pipelines:** text/markdown
```

Attaching a PDF in the Node UI local-agent chat produced agent-visible metadata, but no extraction:

```text
File: Invoice-8E6460E4-0046.pdf
Content type: application/pdf
Extraction status: skipped
Pipeline used: none
Triple count: 0
File hash: keccak256:e6c74d6210ee31e7aded39d754310b03b190ca725bfe9ff793e11ff481c1954b
Assertion: did:dkg:context-graph:file-attachment/assertion/.../Invoice-8E6460E4-0046.pdf
```

Manual attempt to build the converter locally then failed because the host did not have Python venv support installed:

```bash
pnpm --filter @origintrail-official/dkg run markitdown:build
```

Output:

```text
> node ./scripts/bundle-markitdown-binaries.mjs --build-current-platform

MarkItDown bundle: Command failed: python3 -m venv /tmp/dkg-markitdown-build-.../venv
ERR_PNPM_RECURSIVE_RUN_FIRST_FAIL
```

## Why This Matters

The value proposition of bundling MarkItDown is that operators and users should not need to install Python packages, `python3-venv`, PyInstaller, or MarkItDown manually before the node can process common document types such as PDF, DOCX, PPTX, XLSX, CSV, HTML, XML, and EPUB.

If the node starts with only `text/markdown` available, then PDF attachments/imports gracefully degrade to blob storage plus metadata. That is technically safe, but surprising for users because PDF support appears to be part of the shipped node capability.

## Current Code Path / Likely Cause

Relevant code observed in the repo:

- `packages/cli/package.json` runs this on install:

```json
"postinstall": "node ./scripts/bundle-markitdown-binaries.mjs --quiet --current-platform --best-effort"
```

- `packages/cli/scripts/bundle-markitdown-binaries.mjs` skips implicit release-asset download in workspace checkouts unless `--all` or `--build-current-platform` is used:

```js
if (workspace && !opts.all && !opts.buildCurrentPlatform) {
  log('MarkItDown bundle: workspace checkout detected; skipping implicit release-asset download.');
  return;
}
```

- The daemon only registers the converter if `isMarkItDownAvailable()` succeeds at startup:

```ts
const extractionRegistry = new ExtractionPipelineRegistry();
if (isMarkItDownAvailable()) {
  extractionRegistry.register(new MarkItDownConverter());
}
```

- `isMarkItDownAvailable()` requires either:
  - a verified bundled binary and sidecars in `packages/cli/bin`, or
  - a `markitdown` executable on PATH.

In this WSL/dev scenario, neither was present, so the daemon correctly exposed only `text/markdown`.

## Expected Behavior

For supported platforms, a freshly installed/runnable DKG node should have MarkItDown available without requiring users to install Python tooling or manually build the converter.

At minimum, after normal install/build/start steps:

```bash
curl -s http://127.0.0.1:9200/.well-known/skill.md | grep -i "Available extraction pipelines"
```

should include `application/pdf` and the other MarkItDown-backed content types.

A PDF import or chat attachment should produce:

- `extraction.status = "completed"`
- `pipelineUsed = "application/pdf"`
- a `mdIntermediateHash`
- `dkg:markdownForm` pointing to the markdown intermediate
- deterministic structural triples/provenance when the converted Markdown contains extractable structure

## Suggested Fix Directions

1. Ensure release/install artifacts actually include the verified platform MarkItDown binary plus `.sha256` and `.meta.json` sidecars.
2. Decide whether workspace/dev checkouts should download the current-platform release asset by default, build from source, or present a clear one-command setup path that does not silently leave PDF extraction unavailable.
3. Improve daemon/operator visibility when MarkItDown is missing. For example, status/UI could explicitly say: `PDF/DOCX extraction unavailable: MarkItDown binary missing or failed validation.` Currently the symptom is only that extraction pipelines omit those types and imports are skipped.
4. Improve the `markitdown:build` failure message for common Linux/WSL prerequisites, especially missing `python3-venv`.

## Acceptance Criteria

- [ ] On a supported installed node, MarkItDown-backed extraction types are available out of the box.
- [ ] `/.well-known/skill.md` lists `application/pdf` without manual Python/venv setup for normal installs.
- [ ] PDF import through `/api/assertion/import-file` completes and returns `pipelineUsed: "application/pdf"` plus `mdIntermediateHash`.
- [ ] PDF attachment through Node UI local-agent chat produces completed import metadata and agent-visible document context.
- [ ] If MarkItDown is intentionally unavailable in a dev/workspace environment, the node surfaces an actionable diagnostic rather than making users infer the missing converter from `skipped` imports.
- [ ] Tests or release checks cover the presence/validation of bundled MarkItDown assets for supported platforms.

## Related

- Follow-up from PR #446 / issue #256 manual attachment validation.
- Related to the import-file and local-agent attachment flows because converter availability determines whether PDF attachments become document context or metadata-only skipped imports.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MarkItDown converter is not available out of the box in workspace/dev node installs #467

Summary

Observed Behavior

Why This Matters

Current Code Path / Likely Cause

Expected Behavior

Suggested Fix Directions

Acceptance Criteria

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

MarkItDown converter is not available out of the box in workspace/dev node installs #467

Description

Summary

Observed Behavior

Why This Matters

Current Code Path / Likely Cause

Expected Behavior

Suggested Fix Directions

Acceptance Criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions