Feature/back filler#136
Merged
Merged
Conversation
…on in backfilling workflow template
…ing workflow template
…workflow template
Contributor
There was a problem hiding this comment.
Pull request overview
Introduces a new "backfilling" Argo workflow to the welearn-datastack Helm chart and Python codebase. It adds two new node entry points (batch generation + the per-batch backfiller) backed by SQL files on disk, a small query_utils module to load and parameter-validate those SQL files, the first concrete backfill query (populating welearn_document.doi from JSON details), plus Helm templates (WorkflowTemplate, ConfigMap, semaphore) and a welearn-database bump to ^1.4.4.
Changes:
- New
BackFillerworkflow nodes (generate_to_backfill_batch.py,backfilling.py) plus two SQL files for the DOI backfill, wired to a newworkflow-template-backfilling.yaml(with semaphore + config). - New
modules/query_utils.py(resolve_query,resolve_batched_query,resolve_query_on_given_ids) andvalidate_sql_query_paramhelper invalidation.py, with a unit test suite intests/test_query_utils.py. - Sets a Postgres
application_nameviaconnect_argsincreate_sqlalchemy_engine, and bumpswelearn-database(and transitivelypython-dotenv) inpyproject.toml/poetry.lock.
Reviewed changes
Copilot reviewed 13 out of 15 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| welearn_datastack/utils_/database_utils.py | Adds get_main_script_name() and passes it as PG application_name. |
| welearn_datastack/nodes_workflow/BackFiller/generate_to_backfill_batch.py | New node: generates batch CSVs from a SQL query. |
| welearn_datastack/nodes_workflow/BackFiller/backfilling.py | New node: reads batch CSV, runs an ID-scoped backfill SQL. |
| welearn_datastack/nodes_workflow/BackFiller/batch_generator_queries/document_with_doi_column_null.sql | Selects docs whose DOI needs backfilling, gated by alembic revision. |
| welearn_datastack/nodes_workflow/BackFiller/back_filling_queries/update_doi_from_details.sql | UPDATE that extracts DOI from details, records status in tmp_document_doi_status. |
| welearn_datastack/nodes_workflow/BackFiller/init.py | New empty package marker. |
| welearn_datastack/modules/query_utils.py | New SQL-file resolver with bind-param helpers. |
| welearn_datastack/modules/validation.py | Adds validate_sql_query_param substring check. |
| tests/test_query_utils.py | Unit tests for the new query utilities. |
| pyproject.toml | Removes leftover merge conflict markers; bumps welearn-database to ^1.4.4. |
| poetry.lock | Locks welearn-database 1.4.4 / python-dotenv >=1.2.2. |
| k8s/welearn-datastack/values.yaml | Adds backfilling block (config, semaphore, resource sizing). |
| k8s/welearn-datastack/templates/backfilling/workflow-template-backfilling.yaml | New WorkflowTemplate with prepare/run/all steps. |
| k8s/welearn-datastack/templates/backfilling/semaphore.yaml | ConfigMap exposing the backfilling semaphore tokens. |
| k8s/welearn-datastack/templates/backfilling/config.yaml | Component config/secret resources via common helper. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
sandragjacinto
approved these changes
Jun 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces a new "backfilling" workflow to the
welearn-datastackstack, including configuration, workflow templates, and supporting utilities for SQL query resolution and validation. It also adds comprehensive tests for the new query utilities and updates dependencies. The changes are grouped as follows:Backfilling Workflow Integration:
workflow-template-backfilling.yaml) for backfilling, including steps for batch preparation and execution, resource management, and semaphore-based synchronization.values.yamland supporting Helm templates for config and semaphore management. [1] [2] [3]Query Utility Enhancements:
query_utils.pywith functions to resolve SQL queries from files, including parameter validation and support for batching and ID lists.validation.py.Testing:
Backfilling SQL Query:
Dependency Updates:
welearn-databasedependency to version^1.4.4and resolves a minor version conflict forpython-dotenv. [1] [2]These changes collectively enable robust, configurable, and tested backfilling workflows in the data stack.