Upload final model training data by Damonamajor · Pull Request #804 · ccao-data/data-architecture

Damonamajor · 2025-05-05T14:21:10Z

Adds a DBT macro which creates model.training_data which is unique by run_id, card, and meta_card_doc_num.
We ensure this with DBT tests.
This needs to be run manually following each final model run, and uploads with an incremental model, meaning data is only produced for new run_ids.

…o-data/data-architecture into 798-upload-final-model-training-data

Damonamajor · 2025-05-05T17:30:19Z

dbt/models/model/docs.md

+{% docs table_training_data %}
+
+A table containing the training data from the final model runs. This is uploaded
+manually at the end of modeling via the [`S3 model-training_data.R`](https://github.com/ccao-data/data-architecture/tree/master/etl/scripts-ccao-data-warehouse-us-east-1/model/model-training_data.R)


Link won't work until it's merged I believe.

jeancochrane

Looks great to me! Let's get @wrridgeway to review this before we merge, since it affects ETL scripts.

wrridgeway

Alright, I have one key concern about this - we have really only used our etl scripts for two things in the past: loading/cleaning/transforming the data that becomes the features for the model or open data. One of my core assumptions about this folder in the past has been that this is all the scripts in here will do when we trigger them every year.

Perhaps it's time this etl folder becomes more flexible (a folder-level readme or something?), but I'm wondering if this is where we want this script to live or if it should be integrated into the upload scripts in the modeling pipelines where all the other model relevant product is uploaded to aws.

wrridgeway · 2025-05-12T20:15:24Z

etl/scripts-ccao-data-warehouse-us-east-1/model/model-training_data.R

+run_year <- format(Sys.Date(), "%Y")
+
+# Connect to Athena
+noctua_options(cache_size = 10)


Let's scrub the cache syntax and use the unload option instead.

etl/scripts-ccao-data-warehouse-us-east-1/model/model-training_data.R

wrridgeway · 2025-05-12T20:37:33Z

etl/scripts-ccao-data-warehouse-us-east-1/model/model-training_data.R

+)
+
+# Iterate through each run
+for (i in seq_len(nrow(metadata))) {


I would suggest using our usual

pwalk(metadata, \(...) { df <- tibble::tibble(...)

syntax here.

…_data.R Co-authored-by: William Ridgeway <10358980+wrridgeway@users.noreply.github.com>

Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>

jeancochrane

Finally done QCing this new model! With the changes I suggested below, everything works great.

I tested out the following types of runs:

Confirmed that running the model with no pre-existing data creates the table with all expected run IDs
Confirmed that deleting one run ID and rerunning the model with pre-existing data only creates data for the missing run ID
Confirmed that rerunning the model with all final run IDs changes nothing

I also confirmed that row counts and partitioning look correct in all three cases.

I'm excited to have our first incremental model, this is very cool!

dbt/models/model/docs.md

dbt/models/model/model.training_data.py

Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>

wrridgeway · 2025-05-23T14:54:15Z

This is rad, and I'll leave the review to Jean, I just wanted to add some questions for my own edification.

wrridgeway · 2025-05-23T14:55:36Z

dbt/models/model/model.training_data.py

+    if dbt.is_incremental:
+        # anti-join out any run_ids already in the target
+        existing = (
+            session.table(f"{dbt.this.schema}.{dbt.this.identifier}")
+            .select("run_id")
+            .distinct()
+        )
+        metadata_df = metadata_df.join(existing, on="run_id", how="left_anti")


Is the if statement just part of how these models are supposed to be built? Or do we expect it to not be true under some circumstance?

This block is a key part of what makes this an incremental dbt model. Basically, if the model is configured as incremental (which we do in the dbt.config() call above), then this block will execute every time the model runs as long as the table already exists, filtering for only new rows that aren't represented in the table yet. That means that future runs of the model should never overwrite old data; instead, future runs will only write data that has not yet been written to the table.

Happy to talk through this in more detail if it's helpful! But I'd start with reading the dbt docs about incremental models I linked above, since I think those docs do a pretty solid job of explaining it.

Damonamajor · 2025-05-28T18:37:11Z

dbt/models/model/schema.yml

+          combination_of_columns:
+            - run_id
+            - meta_sale_document_num
+            - meta_card_num


For better or worse, outliers and cards are trimmed in stage 1 of the pipeline.

jeancochrane

I think this is finally ready to go, pending two small tweaks! Thanks for your persistence while we polished this up.

dbt/models/model/schema.yml

Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>

Damonamajor added 5 commits April 30, 2025 19:35

land_nbhd_rate_unique_by_town_nbhd_class_and_year

b019ad3

Initial draft

d5b50b5

update docs

43b601d

Merge branch 'master' of github.com:ccao-data/data-architecture

9bc6b2f

Add run_id to query

fe8f802

Damonamajor linked an issue May 5, 2025 that may be closed by this pull request

Upload final model training data #798

Closed

7 tasks

Damonamajor and others added 15 commits May 5, 2025 14:26

add table_

93d9cce

Switch to training data

dd494bf

Update schema.yml

1b2a69f

Update model-training_data.R

ac554d2

Initial draft

4248b1f

update docs

370cece

Add run_id to query

78f1781

add table_

c1d2f3a

Switch to training data

d9b3e3b

Update schema.yml

3f5aa58

Update model-training_data.R

036ce36

Merge branch '798-upload-final-model-training-data' of github.com:cca…

880b08f

…o-data/data-architecture into 798-upload-final-model-training-data

Remove unintended commit

55ea677

Correct unique columns

4505249

Error if > 2

b942b36

Damonamajor commented May 5, 2025

View reviewed changes

Damonamajor marked this pull request as ready for review May 5, 2025 17:37

Damonamajor requested a review from a team as a code owner May 5, 2025 17:37

jeancochrane reviewed May 5, 2025

View reviewed changes

jeancochrane requested a review from wrridgeway May 5, 2025 18:46

jeancochrane mentioned this pull request May 5, 2025

Update checklist for adding training data ccao-data/model-res-avm#377

Merged

wrridgeway requested changes May 12, 2025

View reviewed changes

Damonamajor and others added 2 commits May 12, 2025 15:52

Update etl/scripts-ccao-data-warehouse-us-east-1/model/model-training…

e6bcfb2

…_data.R Co-authored-by: William Ridgeway <10358980+wrridgeway@users.noreply.github.com>

Update etl/scripts-ccao-data-warehouse-us-east-1/model/model-training…

6d85b17

…_data.R Co-authored-by: William Ridgeway <10358980+wrridgeway@users.noreply.github.com>

Damonamajor and others added 4 commits May 20, 2025 18:33

Update dbt/models/model/schema.yml

795b64a

Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>

Update dbt/models/model/model.training_data.py

3cbb70a

Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>

Update dbt/models/model/model.training_data.py

f7314c8

Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>

lintr

4b34ef4

Damonamajor requested a review from jeancochrane May 22, 2025 14:31

jeancochrane approved these changes May 22, 2025

View reviewed changes

dbt/models/model/docs.md Show resolved Hide resolved

dbt/models/model/model.training_data.py Outdated Show resolved Hide resolved

dbt/models/model/model.training_data.py Outdated Show resolved Hide resolved

dbt/models/model/model.training_data.py Outdated Show resolved Hide resolved

Damonamajor and others added 7 commits May 22, 2025 15:06

Update dbt/models/model/docs.md

9858c2f

Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>

Update dbt/models/model/model.training_data.py

483613a

Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>

Update dbt/models/model/model.training_data.py

1b17f6f

Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>

Update dbt/models/model/model.training_data.py

15f7af5

Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>

update to training_data

e622c1a

update to training_data

06e2e76

Update schema

ac8d278

wrridgeway reviewed May 23, 2025

View reviewed changes

wrridgeway approved these changes May 23, 2025

View reviewed changes

Damonamajor added 5 commits May 27, 2025 15:28

update schema

0b290a5

push attempt

654dc6f

update model

9f479d5

update naming

374c9f5

update naming

27feb6e

Damonamajor requested a review from jeancochrane May 28, 2025 16:15

Damonamajor commented May 28, 2025

View reviewed changes

jeancochrane approved these changes May 28, 2025

View reviewed changes

dbt/models/model/schema.yml Show resolved Hide resolved

dbt/models/model/schema.yml Outdated Show resolved Hide resolved

Damonamajor and others added 2 commits May 28, 2025 14:09

Update dbt/models/model/schema.yml

460bb54

Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>

Update dbt/models/model/schema.yml

1dfe88b

Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>

Damonamajor merged commit 63bd65b into master May 29, 2025
8 of 9 checks passed

Damonamajor deleted the 798-upload-final-model-training-data branch May 29, 2025 15:49

jeancochrane mentioned this pull request Jul 10, 2025

Simplify the pinval.vw_assessment_card data model #856

Merged

jeancochrane mentioned this pull request Nov 6, 2025

Switch to using S3 subfolders for DVC ccao-data/model-res-avm#407

Open

Conversation

Damonamajor commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Damonamajor May 5, 2025

Choose a reason for hiding this comment

Uh oh!

jeancochrane left a comment

Choose a reason for hiding this comment

Uh oh!

wrridgeway left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wrridgeway May 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wrridgeway May 12, 2025

Choose a reason for hiding this comment

Uh oh!

jeancochrane left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wrridgeway commented May 23, 2025

Uh oh!

wrridgeway May 23, 2025

Choose a reason for hiding this comment

Uh oh!

jeancochrane May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Damonamajor May 28, 2025

Choose a reason for hiding this comment

Uh oh!

jeancochrane left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Damonamajor commented May 5, 2025 •

edited

Loading

wrridgeway left a comment •

edited

Loading

jeancochrane May 23, 2025 •

edited

Loading