Skip to content

Upload final model training data#804

Merged
Damonamajor merged 57 commits intomasterfrom
798-upload-final-model-training-data
May 29, 2025
Merged

Upload final model training data#804
Damonamajor merged 57 commits intomasterfrom
798-upload-final-model-training-data

Conversation

@Damonamajor
Copy link
Contributor

@Damonamajor Damonamajor commented May 5, 2025

Adds a DBT macro which creates model.training_data which is unique by run_id, card, and meta_card_doc_num.
We ensure this with DBT tests.
This needs to be run manually following each final model run, and uploads with an incremental model, meaning data is only produced for new run_ids.

@Damonamajor Damonamajor linked an issue May 5, 2025 that may be closed by this pull request
7 tasks
{% docs table_training_data %}

A table containing the training data from the final model runs. This is uploaded
manually at the end of modeling via the [`S3 model-training_data.R`](https://github.com/ccao-data/data-architecture/tree/master/etl/scripts-ccao-data-warehouse-us-east-1/model/model-training_data.R)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link won't work until it's merged I believe.

@Damonamajor Damonamajor marked this pull request as ready for review May 5, 2025 17:37
@Damonamajor Damonamajor requested a review from a team as a code owner May 5, 2025 17:37
Copy link
Member

@jeancochrane jeancochrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! Let's get @wrridgeway to review this before we merge, since it affects ETL scripts.

Copy link
Member

@wrridgeway wrridgeway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I have one key concern about this - we have really only used our etl scripts for two things in the past: loading/cleaning/transforming the data that becomes the features for the model or open data. One of my core assumptions about this folder in the past has been that this is all the scripts in here will do when we trigger them every year.

Perhaps it's time this etl folder becomes more flexible (a folder-level readme or something?), but I'm wondering if this is where we want this script to live or if it should be integrated into the upload scripts in the modeling pipelines where all the other model relevant product is uploaded to aws.

run_year <- format(Sys.Date(), "%Y")

# Connect to Athena
noctua_options(cache_size = 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's scrub the cache syntax and use the unload option instead.

)

# Iterate through each run
for (i in seq_len(nrow(metadata))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest using our usual

pwalk(metadata, \(...) {
  df <- tibble::tibble(...)

syntax here.

Damonamajor and others added 2 commits May 12, 2025 15:52
…_data.R

Co-authored-by: William Ridgeway <10358980+wrridgeway@users.noreply.github.com>
…_data.R

Co-authored-by: William Ridgeway <10358980+wrridgeway@users.noreply.github.com>
Damonamajor and others added 4 commits May 20, 2025 18:33
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
@Damonamajor Damonamajor requested a review from jeancochrane May 22, 2025 14:31
Copy link
Member

@jeancochrane jeancochrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally done QCing this new model! With the changes I suggested below, everything works great.

I tested out the following types of runs:

  • Confirmed that running the model with no pre-existing data creates the table with all expected run IDs
  • Confirmed that deleting one run ID and rerunning the model with pre-existing data only creates data for the missing run ID
  • Confirmed that rerunning the model with all final run IDs changes nothing

I also confirmed that row counts and partitioning look correct in all three cases.

I'm excited to have our first incremental model, this is very cool!

Damonamajor and others added 7 commits May 22, 2025 15:06
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
@wrridgeway
Copy link
Member

This is rad, and I'll leave the review to Jean, I just wanted to add some questions for my own edification.

Comment on lines +26 to +33
if dbt.is_incremental:
# anti-join out any run_ids already in the target
existing = (
session.table(f"{dbt.this.schema}.{dbt.this.identifier}")
.select("run_id")
.distinct()
)
metadata_df = metadata_df.join(existing, on="run_id", how="left_anti")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the if statement just part of how these models are supposed to be built? Or do we expect it to not be true under some circumstance?

Copy link
Member

@jeancochrane jeancochrane May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is a key part of what makes this an incremental dbt model. Basically, if the model is configured as incremental (which we do in the dbt.config() call above), then this block will execute every time the model runs as long as the table already exists, filtering for only new rows that aren't represented in the table yet. That means that future runs of the model should never overwrite old data; instead, future runs will only write data that has not yet been written to the table.

Happy to talk through this in more detail if it's helpful! But I'd start with reading the dbt docs about incremental models I linked above, since I think those docs do a pretty solid job of explaining it.

@Damonamajor Damonamajor requested a review from jeancochrane May 28, 2025 16:15
combination_of_columns:
- run_id
- meta_sale_document_num
- meta_card_num
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better or worse, outliers and cards are trimmed in stage 1 of the pipeline.

Copy link
Member

@jeancochrane jeancochrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is finally ready to go, pending two small tweaks! Thanks for your persistence while we polished this up.

Damonamajor and others added 2 commits May 28, 2025 14:09
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
@Damonamajor Damonamajor merged commit 63bd65b into master May 29, 2025
8 of 9 checks passed
@Damonamajor Damonamajor deleted the 798-upload-final-model-training-data branch May 29, 2025 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Upload final model training data

3 participants