Skip to content

Comments

[Feature] openbb-sec: Custom HTML2Markdown Conversion#7361

Open
deeleeramone wants to merge 15 commits intodevelopfrom
feature/sec-html2markdown
Open

[Feature] openbb-sec: Custom HTML2Markdown Conversion#7361
deeleeramone wants to merge 15 commits intodevelopfrom
feature/sec-html2markdown

Conversation

@deeleeramone
Copy link
Contributor

This PR refactors the sec provider to use a custom HTML -> Markdown converter instead of relying on multiple libraries as strategies and cleaning up after them.

It removes dependencies:

  • inscriptis
  • trafilatura

The new module, openbb_sec.utils.html2markdown.html_to_markdown, is geared specifically for SEC HTML content. It has a much higher success rate - especially where table extraction is concerned - and handles a number of scenarios and filing formats. The existing, obb.equity.fundamental.management_discussion_analysis(), is also better equipped to handle more scenarios:

  • Section is in a different part of the same document, or is attached as an exhibit.
  • Content is heavily nested.
  • 2X column wide-page layouts.
  • Embedded images.
mda-jpm Screenshot 2026-02-16 at 3 01 57 PM

You can A/B this against the existing (use include_tables=True) with symbols such as:

  • DE
  • MSFT
  • IBM
  • JPM
  • GS
  • MS
  • WMT
  • NKE
  • NDAQ
  • CRM
  • XOM
  • CVX
  • BRK-A

@deeleeramone deeleeramone added platform OpenBB Platform extensions Extension-related v4 PRs for v4 labels Feb 16, 2026
@github-actions github-actions bot added the enhancement Enhancement label Feb 16, 2026
Copy link
Member

@piiq piiq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this readable/understandable by both humans and AI I suggest we split the huge converter function apart. I understand the complexity of what it's doing, but having that function be that large is a risk it won't be understood or read correctly by any party reading it

@deeleeramone
Copy link
Contributor Author

@piiq, html_to_markdown and convert_table have been reduced to under 1000 lines each. Most of the recursive inner functions have been updated to use module-level functions that pass the necessarily values into the enclosure.

@deeleeramone deeleeramone requested a review from piiq February 20, 2026 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Enhancement extensions Extension-related platform OpenBB Platform v4 PRs for v4

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants