Skip to content

Add include URL CLI filter#63

Merged
obeone merged 1 commit into
mainfrom
codex/add-include-url-argument-for-filtering
Nov 19, 2025
Merged

Add include URL CLI filter#63
obeone merged 1 commit into
mainfrom
codex/add-include-url-argument-for-filtering

Conversation

@obeone
Copy link
Copy Markdown
Owner

@obeone obeone commented Nov 18, 2025

Summary

  • add a new --include-url/-I argument to restrict scraping to URLs containing specified strings
  • enforce include URL patterns during link validation and discovery
  • extend CLI and scraper tests to cover the new filtering behavior

Testing

  • pytest

Codex Task

Summary by CodeRabbit

Release Notes

  • New Features
    • Added --include-url (-I) command-line option to filter crawling to only URLs containing specified strings. This option can be specified multiple times for multiple inclusion patterns, providing complementary filtering alongside existing exclusion options.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Nov 18, 2025

Walkthrough

A new CLI option --include-url (-I) is introduced to enable URL filtering by inclusion patterns. The parameter is threaded from CLI argument parsing through Scraper initialization and integrated into the link validation logic, complementing existing exclude patterns.

Changes

Cohort / File(s) Summary
CLI Argument Handling
crawler_to_md/cli.py
Added --include-url / -I CLI option with append action and default empty list; threaded args.include_url as include_url_patterns parameter to Scraper initialization.
Scraper Core Logic
crawler_to_md/scraper.py
Added include_url_patterns parameter to __init__, stored as instance attribute (default []), and integrated into is_valid_link to require at least one pattern match when patterns are provided. Updated docstring accordingly.
Tests
tests/test_cli.py, tests/test_scraper.py
Updated Scraper mock/fake initializers to accept include_url_patterns parameter; added test_cli_include_url_option to verify CLI passes include patterns to Scraper; updated existing test fixtures to thread new parameter through test scenarios.

Sequence Diagram

sequenceDiagram
    actor User
    participant CLI as CLI Parser
    participant Scraper
    participant Validator as is_valid_link

    User->>CLI: --include-url /docs --include-url /api
    CLI->>CLI: Parse & collect patterns
    CLI->>Scraper: Initialize with include_url_patterns=["/docs", "/api"]
    Scraper->>Scraper: Store include_url_patterns
    
    loop During scraping
        Scraper->>Validator: is_valid_link(url)
        alt include_url_patterns provided
            Validator->>Validator: Check if url contains any pattern
            Validator-->>Scraper: Valid if match found
        else include_url_patterns empty
            Validator-->>Scraper: Valid (no restriction)
        end
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

  • Homogeneous changes across files: parameter threading follows consistent pattern
  • Straightforward filtering logic: simple containment check in is_valid_link
  • No complex control flow modifications or interdependencies
  • Test updates are formulaic with repeated mock initialization patterns

Possibly related PRs

Poem

🐰 A new filter hops into view,
Include-url patterns, tried and true!
URLs now dance in acceptance's light,
CLI whispers, Scraper holds tight.
Filters multiply, both exclude and include
Web crawling's now refined and renewed! 🕷️

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.83% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Add include URL CLI filter' directly matches the main objective—implementing a new --include-url CLI option for URL filtering. The title is concise, specific, and clearly summarizes the primary change.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch codex/add-include-url-argument-for-filtering

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 583e719 and 67010a4.

📒 Files selected for processing (4)
  • crawler_to_md/cli.py (2 hunks)
  • crawler_to_md/scraper.py (4 hunks)
  • tests/test_cli.py (6 hunks)
  • tests/test_scraper.py (11 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tests/test_cli.py (2)
crawler_to_md/scraper.py (1)
  • Scraper (20-376)
crawler_to_md/cli.py (1)
  • main (20-269)
tests/test_scraper.py (1)
crawler_to_md/scraper.py (3)
  • is_valid_link (100-122)
  • Scraper (20-376)
  • fetch_links (124-174)
🪛 Ruff (0.14.5)
tests/test_cli.py

61-61: Unused function argument: include_url_patterns

(ARG001)


106-106: Unused function argument: include_url_patterns

(ARG001)


151-151: Unused function argument: include_url_patterns

(ARG001)


225-225: Unused function argument: include_url_patterns

(ARG001)


282-282: Unused function argument: include_url_patterns

(ARG001)


336-336: Unused function argument: self

(ARG001)


337-337: Unused function argument: base_url

(ARG001)


338-338: Unused function argument: exclude_patterns

(ARG001)


340-340: Unused function argument: db_manager

(ARG001)


341-341: Unused function argument: rate_limit

(ARG001)


342-342: Unused function argument: delay

(ARG001)


343-343: Unused function argument: proxy

(ARG001)


344-344: Unused function argument: include_filters

(ARG001)


345-345: Unused function argument: exclude_filters

(ARG001)


356-356: Unused lambda argument: a

(ARG005)


356-356: Unused lambda argument: k

(ARG005)


357-357: Unused lambda argument: a

(ARG005)


357-357: Unused lambda argument: k

(ARG005)


358-358: Unused lambda argument: a

(ARG005)


358-358: Unused lambda argument: k

(ARG005)

tests/test_scraper.py

311-311: Unused function argument: monkeypatch

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build-and-push
🔇 Additional comments (11)
crawler_to_md/scraper.py (4)

25-25: LGTM! Parameter positioning is logical.

The include_url_patterns parameter is well-positioned after exclude_patterns, grouping URL filtering parameters together.


39-39: LGTM! Documentation is clear.

The docstring accurately describes the include_url_patterns parameter's purpose.


55-55: LGTM! Initialization follows established pattern.

The initialization is consistent with the exclude_patterns handling and properly defaults to an empty list.


114-117: LGTM! Include pattern validation is correctly implemented.

The validation logic properly enforces that at least one include pattern must be present when include_url_patterns is specified. The use of any() efficiently short-circuits on the first match.

crawler_to_md/cli.py (2)

77-83: LGTM! CLI argument properly configured.

The --include-url option is well-configured with action="append" to support multiple patterns. The help text clearly describes its purpose.


220-220: LGTM! CLI argument correctly passed to Scraper.

The include_url_patterns argument is properly threaded from the CLI parser to the Scraper initialization.

tests/test_scraper.py (3)

28-38: LGTM! Existing test properly updated.

The test correctly includes the new include_url_patterns=[] parameter while preserving the original test logic.


40-47: LGTM! Include pattern test provides good coverage.

The test effectively validates that URLs matching the include pattern are accepted while non-matching URLs are rejected.


67-81: LGTM! Integration test thoroughly validates the feature.

This test confirms that fetch_links correctly applies include URL filtering when extracting links from HTML, ensuring only URLs matching the specified pattern are returned.

tests/test_cli.py (2)

57-68: LGTM! Mock signature correctly updated.

The fake initializer properly includes the new include_url_patterns parameter to match the updated Scraper signature.


325-375: LGTM! CLI test thoroughly validates parameter passing.

This test effectively verifies that the --include-url option is correctly parsed and passed to the Scraper as include_url_patterns. The test structure is consistent with other CLI option tests in the file.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@obeone obeone merged commit daca214 into main Nov 19, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant