The src directory has been successfully restructured to support multiple package sources (GitHub, GitLab, SourceForge, etc.) through a modern, scalable architecture.
src/
├── main.py # Main entry point (replaces GitHub.py)
├── core/
│ └── processor.py # Unified package processor
├── sources/ # Source-specific implementations
│ ├── base/ # Base classes and interfaces
│ │ ├── __init__.py # Core interfaces and data models
│ │ ├── url_matcher.py # Base URL matching logic
│ │ └── filter_base.py # Base filtering logic
│ ├── github/ # GitHub source implementation
│ │ ├── __init__.py
│ │ └── github_source.py # Complete GitHub integration
│ ├── gitlab/ # GitLab source implementation
│ │ ├── __init__.py
│ │ └── gitlab_source.py # GitLab integration (template)
│ ├── sourceforge/ # SourceForge source implementation
│ │ ├── __init__.py
│ │ └── sourceforge_source.py # SourceForge integration (template)
│ ├── __init__.py # Source registration and exports
│ └── registry.py # Source factory and registry
├── utils/
│ └── file_utils.py # File operation utilities
└── monitoring/
└── logging_setup.py # Centralized logging configuration
- All sources implement the same
BasePackageSourceinterface - Consistent
PackageMetadata,ReleaseInfo, andRepositoryInfodata models - Auto-detection of source type from URLs
- Easy to add new sources (Bitbucket, custom repos, etc.)
- Plugin-like registration system
- Source-specific configuration and filtering
- Unified
UnifiedPackageProcessorhandles all sources - Async processing support (framework ready)
- Comprehensive error handling and statistics
- Source registry and factory pattern
- Configurable filtering per source
- Advanced URL pattern extraction
# Process packages from CSV with auto-source detection
python src/main.py input.csv
# Show available sources and their status
python src/main.py --status
# Use custom configuration
python src/main.py input.csv --config config.yamlfrom src.sources import auto_detect_and_process, get_factory, SourceType
# Auto-detect source and process URL
package = auto_detect_and_process("https://github.com/owner/repo")
# Use specific source
factory = get_factory()
github_source = factory.create_source(SourceType.GITHUB, config)
package = github_source.extract_package_info("https://github.com/owner/repo")A migration script is provided to help transition:
python scripts/migrate_to_multi_source.pyThis script:
- Creates backups of legacy files
- Marks old files as deprecated
- Creates example configuration
- Generates migration documentation
To add support for a new source (e.g., Bitbucket):
- Create source module:
src/sources/bitbucket/bitbucket_source.py - Implement interface: Extend
BasePackageSource - Register source: Add to
sources/__init__.py - Add configuration: Update config schema
Example:
class BitbucketSource(BasePackageSource):
@property
def source_type(self) -> SourceType:
return SourceType.BITBUCKET
def is_supported_url(self, url: str) -> bool:
return 'bitbucket.org' in url
def extract_package_info(self, url: str) -> Optional[PackageMetadata]:
# Implementation here
passExample multi-source configuration:
source_configs:
github:
github_tokens: ["ghp_token1", "ghp_token2"]
filter:
min_stars: 10
exclude_templates: true
gitlab:
gitlab_token: "glpat_token"
gitlab_instances: ["gitlab.com"]
sourceforge:
filter:
exclude_patterns: ["abandoned", "deprecated"]
processing:
async_enabled: true
max_concurrent: 10
output:
include_summary: true
format: "csv"- Easy addition of new package sources
- Independent source development and testing
- Configurable source-specific behavior
- Clear separation of concerns
- Consistent interfaces across sources
- Centralized configuration and logging
- Process single or multiple sources
- Source-specific filtering and processing
- Pluggable architecture
- Async processing framework
- Efficient source detection
- Optimized URL pattern matching
The new structure includes:
- Unit tests for each source
- Integration tests for the unified processor
- End-to-end tests for the complete pipeline
- Additional Sources: Bitbucket, npm, PyPI, etc.
- Advanced Filtering: ML-based package quality scoring
- Caching: Intelligent caching of API responses
- Monitoring: Enhanced metrics and performance tracking
- Parallel Processing: Multi-source concurrent processing
- Legacy files are preserved as
.deprecated - Migration guide provides step-by-step transition
- Old functionality is preserved in the new architecture
- Gradual migration path supported
This restructure provides a solid foundation for supporting multiple package sources while maintaining the existing functionality and providing clear paths for future expansion.