Skip to content

Conversation

@thusdigital
Copy link

Summary

  • Enhanced source_id extraction for GitHub repositories to use full repo path format
  • Refactored source_id extraction into a dedicated utility function with special GitHub handling

Changes

  • Added extract_source_id() function in utils.py with GitHub-specific logic
  • Updated all source_id extraction calls to use the new centralized function
  • GitHub repos now get source_ids like github.com/user/repo instead of just github.com

Benefits

  • Better source filtering precision when querying RAG data from specific GitHub repositories
  • Consistent source_id extraction logic across the codebase
  • Cleaner separation of concerns with dedicated utility function

Test Results

  • Tested with mcp-mem0 repository ingestion
  • Confirmed source_id correctly shows as github.com/thusdigital/mcp-mem0
  • Hybrid search and filtering work correctly with the new format

by Seth Havens - thus(digital) ltd - thusdigital.com

🤖 Generated with Claude Code

Co-Authored-By: Claude [email protected]

thusdigital and others added 2 commits July 10, 2025 16:57
Resolved issue where relative paths were being incorrectly processed during recursive crawling, causing duplicate URLs and failed crawls.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
… domain

- Add extract_source_id() helper function to detect GitHub URLs
- For GitHub URLs, extract user/repo from path (e.g., 'github.com/user/repo')
- For other URLs, maintain existing domain-based behavior
- Update all source_id assignments in utils.py and crawl4ai_mcp.py
- Enables fine-grained filtering by specific GitHub repositories

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant