Feat: Add repository documentation storage and processing with comprehensive test suite #80

theycallmeswift · 2025-08-05T19:43:10Z

This PR enhances the parse_github_repository tool with comprehensive documentation processing capabilities, storing repository documentation in Supabase for semantic search and RAG operations alongside the existing Neo4j code analysis.

✨ Key Features Added

📚 Documentation Processing

Multi-format support: Processes .md, .rst, .txt, and .ipynb files from GitHub repositories
Built-in Jupyter notebook conversion: Custom JSON parsing to convert notebooks to markdown without external dependencies
Smart content discovery: Automatically discovers documentation files while excluding test directories and large files
Intelligent chunking: Header-aware markdown chunking for better semantic retrieval
Code example extraction: Extracts and indexes code blocks for agentic RAG capabilities

🔄 Enhanced Repository Processing

Dual processing pipeline: Simultaneously handles code analysis (Neo4j) and documentation storage (Supabase)
Unified repository sources: Creates consistent source IDs for linking code and documentation
Comprehensive metadata: Stores file paths, types, and repository information for better organization
Error handling: Graceful partial processing with detailed reporting

🧪 Comprehensive Test Infrastructure

E2E test suite: End-to-end tests for complete repository processing workflows
Unit test coverage: Extensive testing for documentation processing functions
Test utilities: Reusable database helpers and MCP client for consistent testing
Environment validation: Tests verify all required services are properly configured

📁 Files Changed

Core functionality: Enhanced src/utils.py with 500+ lines of documentation processing logic
MCP integration: Updated src/crawl4ai_mcp.py to integrate documentation processing into repository parsing
Test suite: Added comprehensive test infrastructure with E2E and unit tests
Configuration: Updated pyproject.toml with new test dependencies and project structure

🚀 Benefits

Enhanced RAG capabilities: Repository documentation is now searchable and retrievable for better AI assistance
Jupyter notebook support: Seamlessly processes and indexes notebook content for documentation sites
Production ready: Comprehensive test coverage ensures reliability and maintainability
Extensible architecture: Clean separation of concerns allows for easy future enhancements

🔗 Integration Points

Works seamlessly with existing Neo4j knowledge graph functionality
Leverages established Supabase storage patterns
Maintains compatibility with all existing MCP tools and workflows

🧪 Testing

All new functionality is covered by both unit and E2E tests
Tests validate integration with real Supabase and Neo4j instances
Comprehensive test utilities ensure consistent testing patterns

This enhancement significantly expands the MCP server's capabilities, making it a comprehensive solution for both code analysis and documentation processing from GitHub repositories.

Enhanced the repository extractor validation in parse_github_repository to also check for a valid driver. Added .cursor to .gitignore to prevent tracking editor-specific files.

Added support for processing repository documentation files and storing them in Supabase alongside Neo4j code analysis. The DirectNeo4jExtractor now accepts an optional Supabase client and processes documentation using the new process_repository_docs function. Documentation chunking, metadata, and code example extraction are handled in utils.py, and the parse_github_repository tool now returns both code and documentation processing results.

Replaces nbconvert with a pure Python function for converting Jupyter notebooks to markdown in documentation processing. Updates process_document_files to use the new function and adds comprehensive unit tests for documentation discovery, processing, and metadata extraction. Enhances .gitignore for test artifacts and configures test dependencies and pytest options in pyproject.toml.

Introduces end-to-end (E2E) test support with new helpers for database and MCP server interaction, adds E2E tests for GitHub repository parsing, and splits unit and E2E test configurations. Updates pyproject.toml with new dependencies, test markers, and coverage settings. Moves and refines unit test fixtures, and adds a Makefile for common development tasks.

Refactored all knowledge_graphs module imports to use relative imports for package compatibility. Improved create_repository_source_id in utils.py to normalize both SSH and HTTPS repository URLs to a consistent format, and updated related tests for consistency. Added ruff as a development dependency and updated .gitignore and pyproject.toml for ruff support. Cleaned up test and Makefile targets to clarify unit vs. e2e tests. Minor code and logging improvements throughout for clarity and maintainability.

Introduced construct_doc_url to standardize documentation URL creation using repository source IDs. Updated process_repository_docs to use the new function and improved formatting in utils.py. Adjusted test to patch the correct function for error handling.

theycallmeswift added 7 commits August 4, 2025 22:49

Improve repo extractor check and update .gitignore

e260713

Enhanced the repository extractor validation in parse_github_repository to also check for a valid driver. Added .cursor to .gitignore to prevent tracking editor-specific files.

Update README.md

8c3924d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat: Add repository documentation storage and processing with comprehensive test suite #80

Feat: Add repository documentation storage and processing with comprehensive test suite #80

Uh oh!

theycallmeswift commented Aug 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Feat: Add repository documentation storage and processing with comprehensive test suite #80

Are you sure you want to change the base?

Feat: Add repository documentation storage and processing with comprehensive test suite #80

Uh oh!

Conversation

theycallmeswift commented Aug 5, 2025

✨ Key Features Added

📚 Documentation Processing

🔄 Enhanced Repository Processing

🧪 Comprehensive Test Infrastructure

📁 Files Changed

🚀 Benefits

🔗 Integration Points

🧪 Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant