Skip to content

Conversation

@theycallmeswift
Copy link

This PR enhances the parse_github_repository tool with comprehensive documentation processing capabilities, storing repository documentation in Supabase for semantic search and RAG operations alongside the existing Neo4j code analysis.

✨ Key Features Added

📚 Documentation Processing

  • Multi-format support: Processes .md, .rst, .txt, and .ipynb files from GitHub repositories
  • Built-in Jupyter notebook conversion: Custom JSON parsing to convert notebooks to markdown without external dependencies
  • Smart content discovery: Automatically discovers documentation files while excluding test directories and large files
  • Intelligent chunking: Header-aware markdown chunking for better semantic retrieval
  • Code example extraction: Extracts and indexes code blocks for agentic RAG capabilities

🔄 Enhanced Repository Processing

  • Dual processing pipeline: Simultaneously handles code analysis (Neo4j) and documentation storage (Supabase)
  • Unified repository sources: Creates consistent source IDs for linking code and documentation
  • Comprehensive metadata: Stores file paths, types, and repository information for better organization
  • Error handling: Graceful partial processing with detailed reporting

🧪 Comprehensive Test Infrastructure

  • E2E test suite: End-to-end tests for complete repository processing workflows
  • Unit test coverage: Extensive testing for documentation processing functions
  • Test utilities: Reusable database helpers and MCP client for consistent testing
  • Environment validation: Tests verify all required services are properly configured

📁 Files Changed

  • Core functionality: Enhanced src/utils.py with 500+ lines of documentation processing logic
  • MCP integration: Updated src/crawl4ai_mcp.py to integrate documentation processing into repository parsing
  • Test suite: Added comprehensive test infrastructure with E2E and unit tests
  • Configuration: Updated pyproject.toml with new test dependencies and project structure

🚀 Benefits

  1. Enhanced RAG capabilities: Repository documentation is now searchable and retrievable for better AI assistance
  2. Jupyter notebook support: Seamlessly processes and indexes notebook content for documentation sites
  3. Production ready: Comprehensive test coverage ensures reliability and maintainability
  4. Extensible architecture: Clean separation of concerns allows for easy future enhancements

🔗 Integration Points

  • Works seamlessly with existing Neo4j knowledge graph functionality
  • Leverages established Supabase storage patterns
  • Maintains compatibility with all existing MCP tools and workflows

🧪 Testing

  • All new functionality is covered by both unit and E2E tests
  • Tests validate integration with real Supabase and Neo4j instances
  • Comprehensive test utilities ensure consistent testing patterns

This enhancement significantly expands the MCP server's capabilities, making it a comprehensive solution for both code analysis and documentation processing from GitHub repositories.

Enhanced the repository extractor validation in parse_github_repository to also check for a valid driver. Added .cursor to .gitignore to prevent tracking editor-specific files.
Added support for processing repository documentation files and storing them in Supabase alongside Neo4j code analysis. The DirectNeo4jExtractor now accepts an optional Supabase client and processes documentation using the new process_repository_docs function. Documentation chunking, metadata, and code example extraction are handled in utils.py, and the parse_github_repository tool now returns both code and documentation processing results.
Replaces nbconvert with a pure Python function for converting Jupyter notebooks to markdown in documentation processing. Updates process_document_files to use the new function and adds comprehensive unit tests for documentation discovery, processing, and metadata extraction. Enhances .gitignore for test artifacts and configures test dependencies and pytest options in pyproject.toml.
Introduces end-to-end (E2E) test support with new helpers for database and MCP server interaction, adds E2E tests for GitHub repository parsing, and splits unit and E2E test configurations. Updates pyproject.toml with new dependencies, test markers, and coverage settings. Moves and refines unit test fixtures, and adds a Makefile for common development tasks.
Refactored all knowledge_graphs module imports to use relative imports for package compatibility. Improved create_repository_source_id in utils.py to normalize both SSH and HTTPS repository URLs to a consistent format, and updated related tests for consistency. Added ruff as a development dependency and updated .gitignore and pyproject.toml for ruff support. Cleaned up test and Makefile targets to clarify unit vs. e2e tests. Minor code and logging improvements throughout for clarity and maintainability.
Introduced construct_doc_url to standardize documentation URL creation using repository source IDs. Updated process_repository_docs to use the new function and improved formatting in utils.py. Adjusted test to patch the correct function for error handling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant