Open
Description
crawl4ai version
v0.5.0
Expected Behavior
When I want to match the pattern "/this/*", it should not match "/this_is_wrong/".
Current Behavior
I deep crawl with the pattern "https://langchain-ai.github.io/langgraph/*"
and it will match links from "/langgraphjs/"
as well.
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
This will reproduce the problem, but it's probably not necessary to run.
import asyncio
from crawl4ai import (
AsyncWebCrawler,
CacheMode,
ContentTypeFilter,
CrawlerRunConfig,
BFSDeepCrawlStrategy,
FilterChain,
URLPatternFilter,
)
async def crawl(url):
filter_chain = FilterChain(
[
# Should not crawl anything in "/langgraphjs/", yet it does
URLPatternFilter(patterns=[f"{url}*"]),
ContentTypeFilter(allowed_types=["text/html"]),
]
)
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2,
filter_chain=filter_chain,
),
verbose=True,
cache_mode=CacheMode.ENABLED,
)
async with AsyncWebCrawler() as crawler:
results = await crawler.arun(
url,
config=config,
)
for result in results:
print(f"URL: {result.url}")
if __name__ == "__main__":
asyncio.run(crawl("https://langchain-ai.github.io/langgraph/"))
OS
All
Python version
All
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response