Replace html2text (GPL-3.0) with markdownify (MIT)#586
Open
mrummuka wants to merge 3 commits intosafishamsi:v5from
Open
Replace html2text (GPL-3.0) with markdownify (MIT)#586mrummuka wants to merge 3 commits intosafishamsi:v5from
mrummuka wants to merge 3 commits intosafishamsi:v5from
Conversation
Adds 12 tests covering paragraphs, headings (ATX), links, image removal, script/style stripping, bullet lists, line wrapping, empty input, malformed HTML, the regex-strip fallback path, an end-to-end _fetch_webpage smoke test, and a regression guard ensuring the GPL-3.0 'html2text' dependency does not creep back into shipped code or pyproject.toml. Three tests intentionally fail against the current html2text-based implementation; they will turn green when html2text is replaced with markdownify (MIT) in the following commit.
Swap the optional HTML→Markdown converter used by URL ingestion from html2text to markdownify. Aligns the 'pdf' and 'all' extras with the project's MIT license and removes a copyleft dependency that affected anyone redistributing or embedding graphify. Behaviour preserved: - Headings rendered ATX style (# Title) - Links kept inline - Images dropped (matches prior ignore_images=True) - No body-width wrapping - Regex-strip fallback path retained when markdownify is unavailable Script/style blocks are now pre-stripped (with content) before conversion: markdownify's strip= removes tags but preserves their inner text, which previously leaked CSS/JS into the output. All 305 tests pass (293 existing + 12 new).
- CHANGELOG: add Unreleased entry documenting the license-motivated swap - skill.md (both copies): update user-facing description from 'converted to markdown via html2text' to '...via markdownify'
0b48f08 to
3709454
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Swaps the optional HTML→Markdown converter used by URL ingestion from
html2text(GPL-3.0) tomarkdownify(MIT). Aligns thepdfandallextras with the project's MIT license and removes a copyleft dependency that affected anyone redistributing or embedding graphify.Why
html2textis the only GPL-licensed dependency in the project. Its presence in thepdf/allextras creates copyleft obligations that conflict with the project's MIT license and impact downstream redistribution.Changes
graphify/ingest.py:_html_to_markdown()now usesmarkdownifywithheading_style=ATX,bullets='-', andstrip=['img']. Script/style blocks are pre-stripped (with content) via regex because markdownify'sstrip=removes tags but keeps their inner text — preventing CSS/JS from leaking into the emitted markdown (small security improvement).pyproject.toml:html2text→markdownifyin bothpdfandallextras.graphify/skill.mdandskills/graphify/skill.md: user-facing description updated.CHANGELOG.md: Unreleased entry added.Behaviour preserved:
# Title)ignore_images=True)Tests
Adds
tests/test_html_to_markdown.pywith 12 tests:ImportError_fetch_webpageintegration testhtml2textreappears in shipped code orpyproject.tomlLicense posture
html2text(GPL-3.0)markdownify(MIT)Closes the only copyleft dependency in the project.