Skip to content

Conversation

@thusdigital
Copy link

Summary

  • Fix internal link parsing in crawl_recursive_internal_links function
  • Resolve issue where relative paths were being incorrectly processed during recursive crawling
  • Prevent duplicate URLs and failed crawls caused by malformed internal link handling

Problem

The crawler was not handling internal links correctly, causing issues with relative path resolution during recursive crawling. This led to duplicate URLs and failed crawl attempts with gibberish in the vector db

Solution

  • Added proper URL correction logic for internal links
  • Check if normalized URL is substring of internal link href
  • Extract relative path by removing base URL from internal link
  • Reconstruct proper URL by combining base URL with relative path
  • Maintain consistency with existing variable names

Changes

  • Enhanced URL resolution in crawl_recursive_internal_links function
  • Added detailed comments explaining the fix
  • Cleaned up debug logging and unnecessary comments

Test plan

  • Verify URL resolution works correctly for relative paths
  • Confirm no duplicate URLs are generated
  • Test recursive crawling completes successfully

by Seth Havens - thus(digital) ltd - thusdigital.com

Resolved issue where relative paths were being incorrectly processed during recursive crawling, causing duplicate URLs and failed crawls.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@thusdigital
Copy link
Author

@coleam00 hope this helps - great work dude!

0xSylice added a commit to 0xSylice/mcp-crawl4ai-rag that referenced this pull request Aug 20, 2025
0xSylice added a commit to 0xSylice/mcp-crawl4ai-rag that referenced this pull request Aug 20, 2025
0xSylice added a commit to 0xSylice/mcp-crawl4ai-rag that referenced this pull request Aug 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant