Skip to content

[Bug]: URLPatternFilter removes adjacent slash from wildcard "/path/*" to make it "/path*" #1003

Open
@tedvalson

Description

@tedvalson

crawl4ai version

v0.5.0

Expected Behavior

When I want to match the pattern "/this/*", it should not match "/this_is_wrong/".

Current Behavior

I deep crawl with the pattern "https://langchain-ai.github.io/langgraph/*" and it will match links from "/langgraphjs/" as well.

Is this reproducible?

Yes

Inputs Causing the Bug

Steps to Reproduce

Code snippets

This will reproduce the problem, but it's probably not necessary to run.

import asyncio
from crawl4ai import (
    AsyncWebCrawler,
    CacheMode,
    ContentTypeFilter,
    CrawlerRunConfig,
    BFSDeepCrawlStrategy,
    FilterChain,
    URLPatternFilter,
)


async def crawl(url):
    filter_chain = FilterChain(
        [
            # Should not crawl anything in "/langgraphjs/", yet it does
            URLPatternFilter(patterns=[f"{url}*"]),
            ContentTypeFilter(allowed_types=["text/html"]),
        ]
    )

    config = CrawlerRunConfig(
        deep_crawl_strategy=BFSDeepCrawlStrategy(
            max_depth=2,
            filter_chain=filter_chain,
        ),
        verbose=True,
        cache_mode=CacheMode.ENABLED,
    )

    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun(
            url,
            config=config,
        )

        for result in results:
            print(f"URL: {result.url}")


if __name__ == "__main__":
    asyncio.run(crawl("https://langchain-ai.github.io/langgraph/"))

OS

All

Python version

All

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

Metadata

Metadata

Assignees

Labels

🐞 BugSomething isn't working📌 Root causedidentified the root cause of bug

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions