feat: add html filters for markdown export#55
Conversation
WalkthroughThe changes introduce new command-line options to allow users to include or exclude specific HTML elements using CSS-like selectors during Markdown conversion. The documentation, CLI, and Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI
participant Scraper
participant BeautifulSoup
participant MarkItDown
User->>CLI: Run with --include/-i and/or --exclude/-x options
CLI->>Scraper: Instantiate with include_filters, exclude_filters
Scraper->>BeautifulSoup: Parse HTML
alt If include_filters set
Scraper->>BeautifulSoup: Find elements matching include_filters
Scraper->>BeautifulSoup: Rebuild soup with only included elements
end
loop For each selector in exclude_filters
Scraper->>BeautifulSoup: Find elements matching selector
Scraper->>BeautifulSoup: Remove matching elements
end
Scraper->>MarkItDown: Convert filtered HTML to Markdown
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~15 minutes Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
README.md(3 hunks)crawler_to_md/cli.py(3 hunks)crawler_to_md/scraper.py(4 hunks)tests/test_cli.py(4 hunks)tests/test_scraper.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build-and-push
🔇 Additional comments (17)
README.md (4)
38-38: LGTM! Clear feature documentation.The feature description accurately reflects the new HTML element filtering capability using CSS-like selectors.
56-56: Good improvement in command-line documentation.The updated usage line properly reflects the renamed
--exclude-urloption and maintains clarity.
68-68: Excellent clarification of the renamed option.The renaming from
--excludeto--exclude-urlwith the-eshorthand clearly indicates this is for URL filtering, avoiding confusion with the new element exclusion functionality.
73-74: Comprehensive documentation of new filtering options.The documentation clearly explains both include and exclude selectors with appropriate examples and indicates they are repeatable options.
tests/test_scraper.py (1)
106-145: Excellent test coverage for filtering functionality.The test comprehensively verifies the include/exclude filtering logic:
- Properly mocks external dependencies (tempfile, os.remove)
- Tests realistic HTML content with multiple elements
- Verifies that include filters work correctly (
pelements)- Verifies that exclude filters work correctly (
.removeclass)- Confirms elements outside include filters are excluded (
span)- Uses appropriate mocking strategy for MarkItDown converter
The test design is solid and provides good coverage of the new filtering feature.
tests/test_cli.py (3)
65-66: Good compatibility update for existing tests.The existing proxy test functions were properly updated to include the new
include_filtersandexclude_filtersparameters, maintaining test compatibility.
208-262: Comprehensive CLI argument testing.The new test function properly verifies that the CLI correctly passes include and exclude filter arguments to the Scraper constructor. The test uses appropriate mocking and assertions to validate the argument propagation.
264-318: Good coverage of short CLI options.This test ensures that the short options
-iand-xcorrectly map to the include and exclude filters, providing complete coverage of the CLI interface.crawler_to_md/cli.py (3)
71-76: Excellent clarification through option renaming.Renaming
--excludeto--exclude-urlsignificantly improves clarity by explicitly indicating this option is for URL filtering, distinguishing it from the new HTML element exclusion functionality.
116-135: Well-implemented new CLI options.The new
--includeand--excludeoptions are properly configured:
- Use
action="append"for repeatability- Have appropriate short options (
-i,-x)- Include clear help text with examples
- Default to empty lists appropriately
The implementation follows best practices for CLI argument handling.
210-219: Proper parameter propagation to Scraper.The Scraper constructor call correctly includes the new filtering parameters:
exclude_patternsuses the renamedargs.exclude_url- New
include_filtersandexclude_filtersparameters added- Parameter names are consistent with the implementation
crawler_to_md/scraper.py (6)
28-30: Good parameter addition to constructor.The new optional parameters for filtering are properly added to the constructor signature with appropriate defaults.
41-44: Clear documentation of new parameters.The docstring properly documents the new filtering parameters with clear descriptions of their purpose and expected format.
60-61: Proper initialization of filter attributes.The filter lists are correctly initialized with fallback to empty lists when None is provided.
78-94: Well-implemented CSS selector parsing.The
_find_elementsmethod correctly handles the three main CSS selector types:
- ID selectors with
#prefix usingsoup.find(id=...)- Class selectors with
.prefix usingsoup.find_all(class_=...)- Tag selectors using
soup.find_all(selector)The implementation is clean and handles edge cases (e.g., element not found returns empty list).
192-194: LGTM! Correct exclude filter implementation.The exclude filter logic properly finds and removes elements using the
decompose()method, which is the correct way to remove elements from BeautifulSoup.
201-206: Good implementation of filtered HTML processing.The filtered HTML is properly converted to string and written to the temporary file for MarkItDown processing, maintaining the existing workflow while applying the new filtering functionality.
Use a new BeautifulSoup instance and append copies of included elements to maintain valid HTML structure, including handling for body tags.
Summary
Testing
pytesthttps://chatgpt.com/codex/tasks/task_e_688f3068bb14832e8475991938423d7e
Summary by CodeRabbit
New Features
Documentation
Tests