Skip to content

🐛(ingestion-infrastructure) rename columns for ConcoursCleaner#511

Merged
Atheane merged 4 commits into
mainfrom
fix/concours-cleaner-column-rename
May 15, 2026
Merged

🐛(ingestion-infrastructure) rename columns for ConcoursCleaner#511
Atheane merged 4 commits into
mainfrom
fix/concours-cleaner-column-rename

Conversation

@AntoineAugusti
Copy link
Copy Markdown
Member

📝 Description

Fixes #510

Rename columns when loading CONCOURS as column names have changed.

🏷️ Type of change

  • 🐛 Bug fix
  • 🎢 New feature (non-breaking change that adds functionality)
  • 🥁 Breaking change (a modification or feature that would cause existing functionality to stop working as expected) requiring a documentation update
  • 📚 Documentation update
  • ♻️ Refactoring
  • 🔧 Technical change

🔧 Changes

List the main changes

🛸 Required dependencies for this change (if applicable)

🏝️ How to test (if applicable)

Steps to reproduce or test

📸 Screenshots (if applicable)

✅ Checklist

  • 💅 I have added or updated the appropriate tests.
  • 📝 I have updated or added the necessary documentation.
  • 🚀 I have considered the impact on performance, security, and user experience.
  • 👀 I have requested a review from a team member.

@AntoineAugusti AntoineAugusti requested a review from Atheane May 12, 2026 09:08
@AntoineAugusti AntoineAugusti added the bugfix PR: scoped bug fix label May 12, 2026
@AntoineAugusti
Copy link
Copy Markdown
Member Author

./bin/manage clean_documents --type CONCOURS
direnv: loading ~/Documents/csplab/src/tycho/.envrc
INFO 2026-05-12 09:08:53,044 [ingestion] Enqueuing clean task for CONCOURS...
INFO 2026-05-12 09:08:53,044 [ingestion] Starting cleaning 1000 document type: CONCOURS
INFO 2026-05-12 09:08:53,881 [ingestion] ✅ Clean completed: 579/4832 documents of type CONCOURS cleaned
WARNING 2026-05-12 09:08:53,881 [ingestion] ⚠️ 4253 errors occurred
INFO 2026-05-12 09:08:53,881 [ingestion] ✅ Task enqueued successfully.

The behavior is correct. The 4253 "errors" aren't real errors — they're rows that were intentionally filtered out by the cleaner:

Filter Dropped
Statut != VALIDE 79
Année de référence ≤ 2024 3,985
Ministère des Armées 57

After those filters, 711 rows remain. The deduplication (grouping by concours_id) and any null category rows bring it down to the 579 entities.
The "error" count in the use case is just len(raw_documents) - len(cleaned_entities), which conflates filtered rows with real failures.

The column name update is working correctly — the DB already stored raw_data with French column names, and the cleaner now reads them properly.

@AntoineAugusti AntoineAugusti self-assigned this May 12, 2026
Comment thread src/tycho/infrastructure/gateways/ingestion/concours_cleaner.py Outdated
@AntoineAugusti AntoineAugusti force-pushed the fix/concours-cleaner-column-rename branch from c996a9c to 66e7bc2 Compare May 13, 2026 13:35
@Atheane Atheane merged commit 1cbd37f into main May 15, 2026
13 checks passed
@Atheane Atheane deleted the fix/concours-cleaner-column-rename branch May 15, 2026 09:09
This was referenced May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix PR: scoped bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] clean_documents --type CONCOURS raises an error

3 participants