generated from dataforgoodfr/d4g-project-template
-
Notifications
You must be signed in to change notification settings - Fork 8
Open
Description
📋 Summary
Restructure the codebase to eliminate code duplication, improve maintainability, and establish clear boundaries between reusable libraries and deployable services.
🔍 Problem Statement
Current Issues
-
Code Duplication
- Taxonomy definitions exist in two places:
src/wsl_library/domain/rag_system/taxonomy/
- This leads to:
- ❌ Inconsistencies between versions
- ❌ Double maintenance effort
- ❌ Risk of divergence over time
- Taxonomy definitions exist in two places:
-
Unclear Separation of Concerns
wsl_librarymixes domain models, extraction utilities, scraping code, and use cases- Difficult to understand what depends on what
- Hard to test components in isolation
-
Dependency Management Complexity
- Multiple
pyproject.tomlfiles with overlapping dependencies:- Root
pyproject.toml rag_system/pipeline_scripts/pyproject.tomlrag_system/kotaemon/pyproject.tomlrag_system/pipeline_scripts/agentic_data_policies_extraction/requirements.txt
- Root
- Mixed Poetry and requirements.txt formats
- Version conflicts between packages
- Multiple
-
Docker Build Issues
- Dev/Prod stages are nearly identical (only
-eflag differs) - Docker Compose references non-existent build target (
target: full) - Unnecessary rebuilds due to poor layer caching
- Dev/Prod stages are nearly identical (only
-
Testing Difficulties
- Cannot easily test libraries independently
- Integration tests mixed with unit tests
- No clear test strategy per component
💡 Proposed Solution
- We can start on local improvement in
srcandrag_system - Then unify it after
📊 Benefits
- ✅ Single source of truth (no duplication)
- ✅ Clear dependency graph
- ✅ Easier to onboard new developers
⚠️ Risks and Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| Breaking existing workflows | High | Migration script with automated backups |
| Import statement errors | High | Automated search-replace + full test suite |
| Docker build failures | Medium | Test in isolated branch first |
| Lost work during migration | High | Git branch + multiple backups |
| Team confusion | Medium | Documentation + walkthrough session |
🤔 Discussion Points
- Timing: When is the best time to do this migration?
- Testing: Do we have sufficient test coverage to catch breakages?
- CI/CD: What pipeline changes are needed?
- Documentation: What documentation needs to be updated?
- Dependencies: Are there external dependencies on current structure?
- Kotaemon: Should it remain as a git submodule or be integrated differently?
📝 Open Questions
- Do we want to publish any libraries to PyPI eventually?
- Do we need separate CI/CD pipelines per service?
- Should we version libraries independently or keep monorepo versioning?
- Are there any external tools/scripts that depend on current paths?
📚 References
🗳️ Voting Options
Please react to indicate your preference:
- 👍 Approve - Let's do this restructure
- 👀 Review needed - I have questions/concerns
- 🔄 Iterate - Good idea but needs changes
- ❌ Reject - Not worth the effort
Metadata
Metadata
Assignees
Labels
No labels