feat: add html filters for markdown export by obeone · Pull Request #55 · obeone/crawler-to-md

obeone · 2025-08-05T19:18:41Z

Summary

allow filtering HTML elements with CSS-like selectors via unified include/exclude options
expose shorthand -i/-x for selector filters and rename URL exclusion to --exclude-url
document selector usage and cover with tests

Testing

pytest

https://chatgpt.com/codex/tasks/task_e_688f3068bb14832e8475991938423d7e

Summary by CodeRabbit

New Features
- Added command-line options to include or exclude specific HTML elements during Markdown conversion using CSS-like selectors.
- Renamed the URL exclusion option to improve clarity.
Documentation
- Updated the README with descriptions and usage examples for the new and renamed command-line options.
Tests
- Added and updated tests to verify correct handling of the new include and exclude selector options in both the CLI and scraping logic.

coderabbitai · 2025-08-05T19:18:47Z

Walkthrough

The changes introduce new command-line options to allow users to include or exclude specific HTML elements using CSS-like selectors during Markdown conversion. The documentation, CLI, and Scraper class are updated to support and process these options. Corresponding tests are added and updated to verify the correct handling and filtering of HTML elements.

Changes

Cohort / File(s)	Change Summary
Documentation Update `README.md`	Updated to document new `--include`/`-i` and `--exclude`/`-x` options for filtering HTML elements by selector, and to clarify the renaming of `--exclude` to `--exclude-url`. Usage examples and descriptions were added.
CLI Argument Parsing `crawler_to_md/cli.py`	Renamed `--exclude` to `--exclude-url` for URL filtering. Added `--include`/`-i` and `--exclude`/`-x` arguments for HTML element filtering. Updated argument parsing and `Scraper` instantiation to handle new options.
Scraper Filtering Logic `crawler_to_md/scraper.py`	Extended `Scraper` to accept `include_filters` and `exclude_filters` parameters. Added logic to filter HTML content using these selectors before Markdown conversion. Added a private method for element selection and updated HTML handling accordingly.
CLI Tests `tests/test_cli.py`	Added tests for new include/exclude CLI options (both long and short forms). Updated monkeypatched `Scraper.__init__` in proxy and new tests to accept and verify new filter parameters.
Scraper Filtering Test `tests/test_scraper.py`	Added a test to verify that `Scraper` correctly applies include and exclude selectors to filter HTML content before conversion.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant Scraper
    participant BeautifulSoup
    participant MarkItDown

    User->>CLI: Run with --include/-i and/or --exclude/-x options
    CLI->>Scraper: Instantiate with include_filters, exclude_filters
    Scraper->>BeautifulSoup: Parse HTML
    alt If include_filters set
        Scraper->>BeautifulSoup: Find elements matching include_filters
        Scraper->>BeautifulSoup: Rebuild soup with only included elements
    end
    loop For each selector in exclude_filters
        Scraper->>BeautifulSoup: Find elements matching selector
        Scraper->>BeautifulSoup: Remove matching elements
    end
    Scraper->>MarkItDown: Convert filtered HTML to Markdown

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~15 minutes

Poem

In the garden of code where selectors bloom,
Rabbits hop and filter out the gloom.
Include the petals, exclude the weeds,
Markdown magic grows from filtered seeds.
With every test and doc anew,
The CLI now does what you ask it to do!
🐇✨

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6365dae and d17cee6.

📒 Files selected for processing (1)

crawler_to_md/scraper.py (5 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

crawler_to_md/scraper.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build-and-push

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch codex/add-options-for-html-element-inclusion

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 99b8baa and 6365dae.

📒 Files selected for processing (5)

README.md (3 hunks)
crawler_to_md/cli.py (3 hunks)
crawler_to_md/scraper.py (4 hunks)
tests/test_cli.py (4 hunks)
tests/test_scraper.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build-and-push

🔇 Additional comments (17)

README.md (4)

38-38: LGTM! Clear feature documentation.

The feature description accurately reflects the new HTML element filtering capability using CSS-like selectors.

56-56: Good improvement in command-line documentation.

The updated usage line properly reflects the renamed --exclude-url option and maintains clarity.

68-68: Excellent clarification of the renamed option.

The renaming from --exclude to --exclude-url with the -e shorthand clearly indicates this is for URL filtering, avoiding confusion with the new element exclusion functionality.

73-74: Comprehensive documentation of new filtering options.

The documentation clearly explains both include and exclude selectors with appropriate examples and indicates they are repeatable options.

tests/test_scraper.py (1)

106-145: Excellent test coverage for filtering functionality.

The test comprehensively verifies the include/exclude filtering logic:

Properly mocks external dependencies (tempfile, os.remove)

Tests realistic HTML content with multiple elements

Verifies that include filters work correctly (p elements)

Verifies that exclude filters work correctly (.remove class)

Confirms elements outside include filters are excluded (span)

Uses appropriate mocking strategy for MarkItDown converter

The test design is solid and provides good coverage of the new filtering feature.

tests/test_cli.py (3)

65-66: Good compatibility update for existing tests.

The existing proxy test functions were properly updated to include the new include_filters and exclude_filters parameters, maintaining test compatibility.

208-262: Comprehensive CLI argument testing.

The new test function properly verifies that the CLI correctly passes include and exclude filter arguments to the Scraper constructor. The test uses appropriate mocking and assertions to validate the argument propagation.

264-318: Good coverage of short CLI options.

This test ensures that the short options -i and -x correctly map to the include and exclude filters, providing complete coverage of the CLI interface.

crawler_to_md/cli.py (3)

71-76: Excellent clarification through option renaming.

Renaming --exclude to --exclude-url significantly improves clarity by explicitly indicating this option is for URL filtering, distinguishing it from the new HTML element exclusion functionality.

116-135: Well-implemented new CLI options.

The new --include and --exclude options are properly configured:

Use action="append" for repeatability

Have appropriate short options (-i, -x)

Include clear help text with examples

Default to empty lists appropriately

The implementation follows best practices for CLI argument handling.

210-219: Proper parameter propagation to Scraper.

The Scraper constructor call correctly includes the new filtering parameters:

exclude_patterns uses the renamed args.exclude_url

New include_filters and exclude_filters parameters added

Parameter names are consistent with the implementation

crawler_to_md/scraper.py (6)

28-30: Good parameter addition to constructor.

The new optional parameters for filtering are properly added to the constructor signature with appropriate defaults.

41-44: Clear documentation of new parameters.

The docstring properly documents the new filtering parameters with clear descriptions of their purpose and expected format.

60-61: Proper initialization of filter attributes.

The filter lists are correctly initialized with fallback to empty lists when None is provided.

78-94: Well-implemented CSS selector parsing.

The _find_elements method correctly handles the three main CSS selector types:

ID selectors with # prefix using soup.find(id=...)

Class selectors with . prefix using soup.find_all(class_=...)

Tag selectors using soup.find_all(selector)

The implementation is clean and handles edge cases (e.g., element not found returns empty list).

192-194: LGTM! Correct exclude filter implementation.

The exclude filter logic properly finds and removes elements using the decompose() method, which is the correct way to remove elements from BeautifulSoup.

201-206: Good implementation of filtered HTML processing.

The filtered HTML is properly converted to string and written to the temporary file for MarkItDown processing, maintaining the existing workflow while applying the new filtering functionality.

Use a new BeautifulSoup instance and append copies of included elements to maintain valid HTML structure, including handling for body tags.

feat(filters): support css selectors for html include and exclude

6365dae

obeone added the codex label Aug 5, 2025 — with ChatGPT Codex Connector

coderabbitai Bot reviewed Aug 5, 2025

View reviewed changes

Comment thread crawler_to_md/scraper.py

fix(scraper): preserve HTML structure when applying include filters

d17cee6

Use a new BeautifulSoup instance and append copies of included elements to maintain valid HTML structure, including handling for body tags.

obeone merged commit 583e719 into main Aug 6, 2025
6 checks passed

obeone deleted the codex/add-options-for-html-element-inclusion branch August 6, 2025 00:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add html filters for markdown export#55

feat: add html filters for markdown export#55
obeone merged 2 commits into
mainfrom
codex/add-options-for-html-element-inclusion

obeone commented Aug 5, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Aug 5, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

obeone commented Aug 5, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

obeone commented Aug 5, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Aug 5, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)