Skip to content

Feature/back filler#136

Merged
lpi-tn merged 41 commits into
mainfrom
Feature/BackFiller
Jun 3, 2026
Merged

Feature/back filler#136
lpi-tn merged 41 commits into
mainfrom
Feature/BackFiller

Conversation

@lpi-tn
Copy link
Copy Markdown
Collaborator

@lpi-tn lpi-tn commented May 28, 2026

This pull request introduces a new "backfilling" workflow to the welearn-datastack stack, including configuration, workflow templates, and supporting utilities for SQL query resolution and validation. It also adds comprehensive tests for the new query utilities and updates dependencies. The changes are grouped as follows:

Backfilling Workflow Integration:

  • Adds a complete Argo WorkflowTemplate (workflow-template-backfilling.yaml) for backfilling, including steps for batch preparation and execution, resource management, and semaphore-based synchronization.
  • Introduces configuration for backfilling in values.yaml and supporting Helm templates for config and semaphore management. [1] [2] [3]

Query Utility Enhancements:

  • Implements query_utils.py with functions to resolve SQL queries from files, including parameter validation and support for batching and ID lists.
  • Adds a helper function to validate SQL query parameters in validation.py.

Testing:

  • Adds a comprehensive test suite for the new query utility functions, covering correct query resolution, parameter validation, and error handling.

Backfilling SQL Query:

  • Adds a new SQL script to update DOIs from document details, handling duplicates and tracking status.

Dependency Updates:

  • Updates welearn-database dependency to version ^1.4.4 and resolves a minor version conflict for python-dotenv. [1] [2]

These changes collectively enable robust, configurable, and tested backfilling workflows in the data stack.

lpi-tn added 30 commits May 6, 2026 16:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a new "backfilling" Argo workflow to the welearn-datastack Helm chart and Python codebase. It adds two new node entry points (batch generation + the per-batch backfiller) backed by SQL files on disk, a small query_utils module to load and parameter-validate those SQL files, the first concrete backfill query (populating welearn_document.doi from JSON details), plus Helm templates (WorkflowTemplate, ConfigMap, semaphore) and a welearn-database bump to ^1.4.4.

Changes:

  • New BackFiller workflow nodes (generate_to_backfill_batch.py, backfilling.py) plus two SQL files for the DOI backfill, wired to a new workflow-template-backfilling.yaml (with semaphore + config).
  • New modules/query_utils.py (resolve_query, resolve_batched_query, resolve_query_on_given_ids) and validate_sql_query_param helper in validation.py, with a unit test suite in tests/test_query_utils.py.
  • Sets a Postgres application_name via connect_args in create_sqlalchemy_engine, and bumps welearn-database (and transitively python-dotenv) in pyproject.toml / poetry.lock.

Reviewed changes

Copilot reviewed 13 out of 15 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
welearn_datastack/utils_/database_utils.py Adds get_main_script_name() and passes it as PG application_name.
welearn_datastack/nodes_workflow/BackFiller/generate_to_backfill_batch.py New node: generates batch CSVs from a SQL query.
welearn_datastack/nodes_workflow/BackFiller/backfilling.py New node: reads batch CSV, runs an ID-scoped backfill SQL.
welearn_datastack/nodes_workflow/BackFiller/batch_generator_queries/document_with_doi_column_null.sql Selects docs whose DOI needs backfilling, gated by alembic revision.
welearn_datastack/nodes_workflow/BackFiller/back_filling_queries/update_doi_from_details.sql UPDATE that extracts DOI from details, records status in tmp_document_doi_status.
welearn_datastack/nodes_workflow/BackFiller/init.py New empty package marker.
welearn_datastack/modules/query_utils.py New SQL-file resolver with bind-param helpers.
welearn_datastack/modules/validation.py Adds validate_sql_query_param substring check.
tests/test_query_utils.py Unit tests for the new query utilities.
pyproject.toml Removes leftover merge conflict markers; bumps welearn-database to ^1.4.4.
poetry.lock Locks welearn-database 1.4.4 / python-dotenv >=1.2.2.
k8s/welearn-datastack/values.yaml Adds backfilling block (config, semaphore, resource sizing).
k8s/welearn-datastack/templates/backfilling/workflow-template-backfilling.yaml New WorkflowTemplate with prepare/run/all steps.
k8s/welearn-datastack/templates/backfilling/semaphore.yaml ConfigMap exposing the backfilling semaphore tokens.
k8s/welearn-datastack/templates/backfilling/config.yaml Component config/secret resources via common helper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread welearn_datastack/nodes_workflow/BackFiller/backfilling.py
Comment thread welearn_datastack/nodes_workflow/BackFiller/backfilling.py
Comment thread welearn_datastack/modules/validation.py Outdated
Comment thread welearn_datastack/modules/query_utils.py
Comment thread welearn_datastack/modules/query_utils.py
Comment thread welearn_datastack/nodes_workflow/BackFiller/generate_to_backfill_batch.py Outdated
Comment thread welearn_datastack/nodes_workflow/BackFiller/backfilling.py
@lpi-tn lpi-tn merged commit 6873d84 into main Jun 3, 2026
7 checks passed
@lpi-tn lpi-tn deleted the Feature/BackFiller branch June 3, 2026 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants