Skip to content

Proposal: Restructure Project into Monorepo with Clear Library/Service Separation #23

@MNIKIEMA

Description

@MNIKIEMA

📋 Summary

Restructure the codebase to eliminate code duplication, improve maintainability, and establish clear boundaries between reusable libraries and deployable services.

🔍 Problem Statement

Current Issues

  1. Code Duplication

    • Taxonomy definitions exist in two places:
      • src/wsl_library/domain/
      • rag_system/taxonomy/
    • This leads to:
      • ❌ Inconsistencies between versions
      • ❌ Double maintenance effort
      • ❌ Risk of divergence over time
  2. Unclear Separation of Concerns

    • wsl_library mixes domain models, extraction utilities, scraping code, and use cases
    • Difficult to understand what depends on what
    • Hard to test components in isolation
  3. Dependency Management Complexity

    • Multiple pyproject.toml files with overlapping dependencies:
      • Root pyproject.toml
      • rag_system/pipeline_scripts/pyproject.toml
      • rag_system/kotaemon/pyproject.toml
      • rag_system/pipeline_scripts/agentic_data_policies_extraction/requirements.txt
    • Mixed Poetry and requirements.txt formats
    • Version conflicts between packages
  4. Docker Build Issues

    • Dev/Prod stages are nearly identical (only -e flag differs)
    • Docker Compose references non-existent build target (target: full)
    • Unnecessary rebuilds due to poor layer caching
  5. Testing Difficulties

    • Cannot easily test libraries independently
    • Integration tests mixed with unit tests
    • No clear test strategy per component

💡 Proposed Solution

  • We can start on local improvement in src and rag_system
  • Then unify it after

📊 Benefits

  • ✅ Single source of truth (no duplication)
  • ✅ Clear dependency graph
  • ✅ Easier to onboard new developers

⚠️ Risks and Mitigations

Risk Impact Mitigation
Breaking existing workflows High Migration script with automated backups
Import statement errors High Automated search-replace + full test suite
Docker build failures Medium Test in isolated branch first
Lost work during migration High Git branch + multiple backups
Team confusion Medium Documentation + walkthrough session

🤔 Discussion Points

  1. Timing: When is the best time to do this migration?
  2. Testing: Do we have sufficient test coverage to catch breakages?
  3. CI/CD: What pipeline changes are needed?
  4. Documentation: What documentation needs to be updated?
  5. Dependencies: Are there external dependencies on current structure?
  6. Kotaemon: Should it remain as a git submodule or be integrated differently?

📝 Open Questions

  • Do we want to publish any libraries to PyPI eventually?
  • Do we need separate CI/CD pipelines per service?
  • Should we version libraries independently or keep monorepo versioning?
  • Are there any external tools/scripts that depend on current paths?

📚 References

🗳️ Voting Options

Please react to indicate your preference:

  • 👍 Approve - Let's do this restructure
  • 👀 Review needed - I have questions/concerns
  • 🔄 Iterate - Good idea but needs changes
  • Reject - Not worth the effort

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions