Conversation
…o-data/data-architecture into 798-upload-final-model-training-data
dbt/models/model/docs.md
Outdated
| {% docs table_training_data %} | ||
|
|
||
| A table containing the training data from the final model runs. This is uploaded | ||
| manually at the end of modeling via the [`S3 model-training_data.R`](https://github.com/ccao-data/data-architecture/tree/master/etl/scripts-ccao-data-warehouse-us-east-1/model/model-training_data.R) |
There was a problem hiding this comment.
Link won't work until it's merged I believe.
jeancochrane
left a comment
There was a problem hiding this comment.
Looks great to me! Let's get @wrridgeway to review this before we merge, since it affects ETL scripts.
There was a problem hiding this comment.
Alright, I have one key concern about this - we have really only used our etl scripts for two things in the past: loading/cleaning/transforming the data that becomes the features for the model or open data. One of my core assumptions about this folder in the past has been that this is all the scripts in here will do when we trigger them every year.
Perhaps it's time this etl folder becomes more flexible (a folder-level readme or something?), but I'm wondering if this is where we want this script to live or if it should be integrated into the upload scripts in the modeling pipelines where all the other model relevant product is uploaded to aws.
| run_year <- format(Sys.Date(), "%Y") | ||
|
|
||
| # Connect to Athena | ||
| noctua_options(cache_size = 10) |
There was a problem hiding this comment.
Let's scrub the cache syntax and use the unload option instead.
etl/scripts-ccao-data-warehouse-us-east-1/model/model-training_data.R
Outdated
Show resolved
Hide resolved
etl/scripts-ccao-data-warehouse-us-east-1/model/model-training_data.R
Outdated
Show resolved
Hide resolved
| ) | ||
|
|
||
| # Iterate through each run | ||
| for (i in seq_len(nrow(metadata))) { |
There was a problem hiding this comment.
I would suggest using our usual
pwalk(metadata, \(...) {
df <- tibble::tibble(...)
syntax here.
…_data.R Co-authored-by: William Ridgeway <10358980+wrridgeway@users.noreply.github.com>
…_data.R Co-authored-by: William Ridgeway <10358980+wrridgeway@users.noreply.github.com>
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
jeancochrane
left a comment
There was a problem hiding this comment.
Finally done QCing this new model! With the changes I suggested below, everything works great.
I tested out the following types of runs:
- Confirmed that running the model with no pre-existing data creates the table with all expected run IDs
- Confirmed that deleting one run ID and rerunning the model with pre-existing data only creates data for the missing run ID
- Confirmed that rerunning the model with all final run IDs changes nothing
I also confirmed that row counts and partitioning look correct in all three cases.
I'm excited to have our first incremental model, this is very cool!
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
|
This is rad, and I'll leave the review to Jean, I just wanted to add some questions for my own edification. |
| if dbt.is_incremental: | ||
| # anti-join out any run_ids already in the target | ||
| existing = ( | ||
| session.table(f"{dbt.this.schema}.{dbt.this.identifier}") | ||
| .select("run_id") | ||
| .distinct() | ||
| ) | ||
| metadata_df = metadata_df.join(existing, on="run_id", how="left_anti") |
There was a problem hiding this comment.
Is the if statement just part of how these models are supposed to be built? Or do we expect it to not be true under some circumstance?
There was a problem hiding this comment.
This block is a key part of what makes this an incremental dbt model. Basically, if the model is configured as incremental (which we do in the dbt.config() call above), then this block will execute every time the model runs as long as the table already exists, filtering for only new rows that aren't represented in the table yet. That means that future runs of the model should never overwrite old data; instead, future runs will only write data that has not yet been written to the table.
Happy to talk through this in more detail if it's helpful! But I'd start with reading the dbt docs about incremental models I linked above, since I think those docs do a pretty solid job of explaining it.
| combination_of_columns: | ||
| - run_id | ||
| - meta_sale_document_num | ||
| - meta_card_num |
There was a problem hiding this comment.
For better or worse, outliers and cards are trimmed in stage 1 of the pipeline.
jeancochrane
left a comment
There was a problem hiding this comment.
I think this is finally ready to go, pending two small tweaks! Thanks for your persistence while we polished this up.
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
Adds a DBT macro which creates model.training_data which is unique by run_id, card, and meta_card_doc_num.
We ensure this with DBT tests.
This needs to be run manually following each final model run, and uploads with an incremental model, meaning data is only produced for new run_ids.