Skip to content

Latest commit

 

History

History
196 lines (151 loc) · 5.77 KB

File metadata and controls

196 lines (151 loc) · 5.77 KB

Multi-Source Architecture - Implementation Summary

Overview

The src directory has been successfully restructured to support multiple package sources (GitHub, GitLab, SourceForge, etc.) through a modern, scalable architecture.

New Structure

src/
├── main.py                     # Main entry point (replaces GitHub.py)
├── core/
│   └── processor.py            # Unified package processor
├── sources/                    # Source-specific implementations
│   ├── base/                   # Base classes and interfaces
│   │   ├── __init__.py         # Core interfaces and data models
│   │   ├── url_matcher.py      # Base URL matching logic
│   │   └── filter_base.py      # Base filtering logic
│   ├── github/                 # GitHub source implementation
│   │   ├── __init__.py
│   │   └── github_source.py    # Complete GitHub integration
│   ├── gitlab/                 # GitLab source implementation
│   │   ├── __init__.py
│   │   └── gitlab_source.py    # GitLab integration (template)
│   ├── sourceforge/            # SourceForge source implementation
│   │   ├── __init__.py
│   │   └── sourceforge_source.py  # SourceForge integration (template)
│   ├── __init__.py             # Source registration and exports
│   └── registry.py             # Source factory and registry
├── utils/
│   └── file_utils.py           # File operation utilities
└── monitoring/
    └── logging_setup.py        # Centralized logging configuration

Key Features

1. Unified Interface

  • All sources implement the same BasePackageSource interface
  • Consistent PackageMetadata, ReleaseInfo, and RepositoryInfo data models
  • Auto-detection of source type from URLs

2. Extensible Architecture

  • Easy to add new sources (Bitbucket, custom repos, etc.)
  • Plugin-like registration system
  • Source-specific configuration and filtering

3. Enhanced Processing

  • Unified UnifiedPackageProcessor handles all sources
  • Async processing support (framework ready)
  • Comprehensive error handling and statistics

4. Modern Tooling

  • Source registry and factory pattern
  • Configurable filtering per source
  • Advanced URL pattern extraction

Usage Examples

Basic Usage

# Process packages from CSV with auto-source detection
python src/main.py input.csv

# Show available sources and their status
python src/main.py --status

# Use custom configuration
python src/main.py input.csv --config config.yaml

Programmatic Usage

from src.sources import auto_detect_and_process, get_factory, SourceType

# Auto-detect source and process URL
package = auto_detect_and_process("https://github.com/owner/repo")

# Use specific source
factory = get_factory()
github_source = factory.create_source(SourceType.GITHUB, config)
package = github_source.extract_package_info("https://github.com/owner/repo")

Migration from Legacy Code

A migration script is provided to help transition:

python scripts/migrate_to_multi_source.py

This script:

  • Creates backups of legacy files
  • Marks old files as deprecated
  • Creates example configuration
  • Generates migration documentation

Adding New Sources

To add support for a new source (e.g., Bitbucket):

  1. Create source module: src/sources/bitbucket/bitbucket_source.py
  2. Implement interface: Extend BasePackageSource
  3. Register source: Add to sources/__init__.py
  4. Add configuration: Update config schema

Example:

class BitbucketSource(BasePackageSource):
    @property
    def source_type(self) -> SourceType:
        return SourceType.BITBUCKET
    
    def is_supported_url(self, url: str) -> bool:
        return 'bitbucket.org' in url
    
    def extract_package_info(self, url: str) -> Optional[PackageMetadata]:
        # Implementation here
        pass

Configuration

Example multi-source configuration:

source_configs:
  github:
    github_tokens: ["ghp_token1", "ghp_token2"]
    filter:
      min_stars: 10
      exclude_templates: true
  
  gitlab:
    gitlab_token: "glpat_token"
    gitlab_instances: ["gitlab.com"]
  
  sourceforge:
    filter:
      exclude_patterns: ["abandoned", "deprecated"]

processing:
  async_enabled: true
  max_concurrent: 10

output:
  include_summary: true
  format: "csv"

Benefits

1. Scalability

  • Easy addition of new package sources
  • Independent source development and testing
  • Configurable source-specific behavior

2. Maintainability

  • Clear separation of concerns
  • Consistent interfaces across sources
  • Centralized configuration and logging

3. Flexibility

  • Process single or multiple sources
  • Source-specific filtering and processing
  • Pluggable architecture

4. Performance

  • Async processing framework
  • Efficient source detection
  • Optimized URL pattern matching

Testing

The new structure includes:

  • Unit tests for each source
  • Integration tests for the unified processor
  • End-to-end tests for the complete pipeline

Future Enhancements

  1. Additional Sources: Bitbucket, npm, PyPI, etc.
  2. Advanced Filtering: ML-based package quality scoring
  3. Caching: Intelligent caching of API responses
  4. Monitoring: Enhanced metrics and performance tracking
  5. Parallel Processing: Multi-source concurrent processing

Backward Compatibility

  • Legacy files are preserved as .deprecated
  • Migration guide provides step-by-step transition
  • Old functionality is preserved in the new architecture
  • Gradual migration path supported

This restructure provides a solid foundation for supporting multiple package sources while maintaining the existing functionality and providing clear paths for future expansion.