Skip to content

Replace html2text (GPL-3.0) with markdownify (MIT)#586

Open
mrummuka wants to merge 3 commits intosafishamsi:v5from
mrummuka:replace-html2text-with-markdownify
Open

Replace html2text (GPL-3.0) with markdownify (MIT)#586
mrummuka wants to merge 3 commits intosafishamsi:v5from
mrummuka:replace-html2text-with-markdownify

Conversation

@mrummuka
Copy link
Copy Markdown

Summary

Swaps the optional HTML→Markdown converter used by URL ingestion from html2text (GPL-3.0) to markdownify (MIT). Aligns the pdf and all extras with the project's MIT license and removes a copyleft dependency that affected anyone redistributing or embedding graphify.

Why

html2text is the only GPL-licensed dependency in the project. Its presence in the pdf/all extras creates copyleft obligations that conflict with the project's MIT license and impact downstream redistribution.

Changes

  • graphify/ingest.py: _html_to_markdown() now uses markdownify with heading_style=ATX, bullets='-', and strip=['img']. Script/style blocks are pre-stripped (with content) via regex because markdownify's strip= removes tags but keeps their inner text — preventing CSS/JS from leaking into the emitted markdown (small security improvement).
  • pyproject.toml: html2textmarkdownify in both pdf and all extras.
  • graphify/skill.md and skills/graphify/skill.md: user-facing description updated.
  • CHANGELOG.md: Unreleased entry added.

Behaviour preserved:

  • ATX headings (# Title)
  • Inline links kept
  • Images dropped (matches prior ignore_images=True)
  • No body-width wrapping
  • Regex-strip fallback path retained when markdownify is unavailable

Tests

Adds tests/test_html_to_markdown.py with 12 tests:

  • 9 conversion tests (paragraphs, ATX headings, links, image removal, script/style stripping, bullet lists, no wrapping, empty input, malformed HTML)
  • Fallback path test that forces ImportError
  • End-to-end _fetch_webpage integration test
  • Regression guard that fails if html2text reappears in shipped code or pyproject.toml
Baseline: 293 passed
After:    305 passed (293 + 12 new), 0 failed, 0 regressions

License posture

Dep Before After
HTML→MD converter html2text (GPL-3.0) markdownify (MIT)

Closes the only copyleft dependency in the project.

@mrummuka mrummuka changed the base branch from v4 to v5 April 28, 2026 09:33
Adds 12 tests covering paragraphs, headings (ATX), links, image
removal, script/style stripping, bullet lists, line wrapping, empty
input, malformed HTML, the regex-strip fallback path, an end-to-end
_fetch_webpage smoke test, and a regression guard ensuring the
GPL-3.0 'html2text' dependency does not creep back into shipped code
or pyproject.toml.

Three tests intentionally fail against the current html2text-based
implementation; they will turn green when html2text is replaced with
markdownify (MIT) in the following commit.
Swap the optional HTML→Markdown converter used by URL ingestion from
html2text to markdownify. Aligns the 'pdf' and 'all' extras with the
project's MIT license and removes a copyleft dependency that affected
anyone redistributing or embedding graphify.

Behaviour preserved:
- Headings rendered ATX style (# Title)
- Links kept inline
- Images dropped (matches prior ignore_images=True)
- No body-width wrapping
- Regex-strip fallback path retained when markdownify is unavailable

Script/style blocks are now pre-stripped (with content) before
conversion: markdownify's strip= removes tags but preserves their
inner text, which previously leaked CSS/JS into the output.

All 305 tests pass (293 existing + 12 new).
- CHANGELOG: add Unreleased entry documenting the license-motivated swap
- skill.md (both copies): update user-facing description from
  'converted to markdown via html2text' to '...via markdownify'
@mrummuka mrummuka force-pushed the replace-html2text-with-markdownify branch from 0b48f08 to 3709454 Compare April 28, 2026 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant