Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
b019ad3
land_nbhd_rate_unique_by_town_nbhd_class_and_year
Damonamajor Mar 25, 2025
d5b50b5
Initial draft
Damonamajor May 1, 2025
43b601d
update docs
Damonamajor May 2, 2025
9bc6b2f
Merge branch 'master' of github.com:ccao-data/data-architecture
Damonamajor May 2, 2025
fe8f802
Add run_id to query
Damonamajor May 5, 2025
93d9cce
add table_
Damonamajor May 5, 2025
dd494bf
Switch to training data
Damonamajor May 5, 2025
1b2a69f
Update schema.yml
Damonamajor May 5, 2025
ac554d2
Update model-training_data.R
Damonamajor May 5, 2025
4248b1f
Initial draft
Damonamajor May 1, 2025
370cece
update docs
Damonamajor May 2, 2025
78f1781
Add run_id to query
Damonamajor May 5, 2025
c1d2f3a
add table_
Damonamajor May 5, 2025
d9b3e3b
Switch to training data
Damonamajor May 5, 2025
3f5aa58
Update schema.yml
Damonamajor May 5, 2025
036ce36
Update model-training_data.R
Damonamajor May 5, 2025
880b08f
Merge branch '798-upload-final-model-training-data' of github.com:cca…
Damonamajor May 5, 2025
55ea677
Remove unintended commit
Damonamajor May 5, 2025
4505249
Correct unique columns
Damonamajor May 5, 2025
b942b36
Error if > 2
Damonamajor May 5, 2025
e6bcfb2
Update etl/scripts-ccao-data-warehouse-us-east-1/model/model-training…
Damonamajor May 12, 2025
6d85b17
Update etl/scripts-ccao-data-warehouse-us-east-1/model/model-training…
Damonamajor May 12, 2025
ca631d9
updated py script
Damonamajor May 14, 2025
23b6b6d
Merge branch '798-upload-final-model-training-data' of github.com:cca…
Damonamajor May 14, 2025
c4c933e
styler
Damonamajor May 14, 2025
42eb7d6
test2
Damonamajor May 14, 2025
91fbedc
Fix ref
Damonamajor May 14, 2025
172c0ce
Updated push
Damonamajor May 14, 2025
71c57c2
updates
Damonamajor May 15, 2025
9fe0d17
Another attempt
Damonamajor May 15, 2025
9692b61
Resolve SSL errors and use Spark DataFrames for `model.training_data`…
jeancochrane May 19, 2025
6e7c7da
Remove default file_format
Damonamajor May 19, 2025
bef997c
Add unique key
Damonamajor May 19, 2025
96fc296
possible functional version
Damonamajor May 20, 2025
1d9ec1c
Functional version
Damonamajor May 20, 2025
afe1b13
Remove old script
Damonamajor May 20, 2025
fb748a8
Remove unique_key
Damonamajor May 20, 2025
5e6e3fe
Commenting
Damonamajor May 20, 2025
a9cdffa
Update docs.md
Damonamajor May 20, 2025
795b64a
Update dbt/models/model/schema.yml
Damonamajor May 20, 2025
3cbb70a
Update dbt/models/model/model.training_data.py
Damonamajor May 20, 2025
f7314c8
Update dbt/models/model/model.training_data.py
Damonamajor May 20, 2025
4b34ef4
lintr
Damonamajor May 21, 2025
9858c2f
Update dbt/models/model/docs.md
Damonamajor May 22, 2025
483613a
Update dbt/models/model/model.training_data.py
Damonamajor May 22, 2025
1b17f6f
Update dbt/models/model/model.training_data.py
Damonamajor May 22, 2025
15f7af5
Update dbt/models/model/model.training_data.py
Damonamajor May 22, 2025
e622c1a
update to training_data
Damonamajor May 22, 2025
06e2e76
update to training_data
Damonamajor May 22, 2025
ac8d278
Update schema
Damonamajor May 22, 2025
0b290a5
update schema
Damonamajor May 27, 2025
654dc6f
push attempt
Damonamajor May 28, 2025
9f479d5
update model
Damonamajor May 28, 2025
374c9f5
update naming
Damonamajor May 28, 2025
27feb6e
update naming
Damonamajor May 28, 2025
460bb54
Update dbt/models/model/schema.yml
Damonamajor May 28, 2025
1dfe88b
Update dbt/models/model/schema.yml
Damonamajor May 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions dbt/models/model/docs.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,19 @@ Wall time of each stage (train, assess, etc.) for each model run (`run_id`).
**Primary Key**: `year`, `run_id`
{% enddocs %}

# training_data

{% docs table_training_data %}

A table containing the training data from the final model runs.

We update this table once per assessment year after choosing the final model
runs for the year. As such, only final model run IDs should be present in this
table.

**Primary Key**: `run_id`, `meta_card_num`, `meta_sale_document_num`
{% enddocs %}

# vw_card_res_input

{% docs view_vw_card_res_input %}
Expand Down
80 changes: 80 additions & 0 deletions dbt/models/model/model.training_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
import functools

from pyspark.sql.functions import lit


def model(dbt, session):
dbt.config(
materialized="incremental",
incremental_strategy="insert_overwrite",
partitioned_by=["assessment_year", "run_id", "meta_township_code"],
on_schema_change="append_new_columns",
)

# Build the base metadata DataFrame
base_query = """
SELECT
run_id,
year,
assessment_year,
dvc_md5_training_data
FROM model.metadata
WHERE run_type = 'final'
"""
metadata_df = session.sql(base_query)

if dbt.is_incremental:
# anti-join out any run_ids already in the target
existing = (
session.table(f"{dbt.this.schema}.{dbt.this.identifier}")
.select("run_id")
.distinct()
)
metadata_df = metadata_df.join(existing, on="run_id", how="left_anti")
Comment on lines +26 to +33
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the if statement just part of how these models are supposed to be built? Or do we expect it to not be true under some circumstance?

Copy link
Member

@jeancochrane jeancochrane May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is a key part of what makes this an incremental dbt model. Basically, if the model is configured as incremental (which we do in the dbt.config() call above), then this block will execute every time the model runs as long as the table already exists, filtering for only new rows that aren't represented in the table yet. That means that future runs of the model should never overwrite old data; instead, future runs will only write data that has not yet been written to the table.

Happy to talk through this in more detail if it's helpful! But I'd start with reading the dbt docs about incremental models I linked above, since I think those docs do a pretty solid job of explaining it.


# if there’s nothing new, return an *empty* DataFrame
if metadata_df.limit(1).count() == 0:
print(">>> no new run_id found; skipping incremental update")
# this returns zero rows but preserves the full target schema
return session.table(
f"{dbt.this.schema}.{dbt.this.identifier}"
).limit(0)

# Collect remaining metadata
metadata = metadata_df.toPandas()

bucket = "ccao-data-dvc-us-east-1"
all_dfs = []

for _, row in metadata.iterrows():
run_id = row["run_id"]
year = int(row["year"])
h = row["dvc_md5_training_data"]

prefix = "" if year <= 2023 else "files/md5/"
key = f"{prefix}{h[:2]}/{h[2:]}"
s3p = f"{bucket}/{key}"

print(f">>> reading all columns for run {run_id!r}")
print(f" → S3 key = {s3p}")
df = session.read.parquet(f"s3://{s3p}")

# coerce booleans for mismatched types
if "ccao_is_active_exe_homeowner" in df.columns:
df = df.withColumn(
"ccao_is_active_exe_homeowner",
df["ccao_is_active_exe_homeowner"].cast("boolean"),
)

# add run_id and assessment_year columns
df = df.withColumn("run_id", lit(run_id)).withColumn(
"assessment_year", lit(row["assessment_year"])
)

all_dfs.append(df)
print(f"Processed run_id={run_id}, rows={df.count()}")

# Union all the new runs together
return functools.reduce(
lambda x, y: x.unionByName(y, allowMissingColumns=True), all_dfs
)
13 changes: 13 additions & 0 deletions dbt/models/model/schema.yml
Original file line number Diff line number Diff line change
Expand Up @@ -230,6 +230,19 @@ models:
description: |
Any notes or caveats associated with the model run

- name: model.training_data
description: '{{ doc("table_training_data") }}'
config:
tags:
- load_manual
tests:
- unique_combination_of_columns:
name: model_training_data_unique_card_doc_number_run_id
combination_of_columns:
- run_id
- meta_sale_document_num
- meta_card_num
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better or worse, outliers and cards are trimmed in stage 1 of the pipeline.


- name: model.vw_pin_shared_input
description: '{{ doc("view_vw_pin_shared_input") }}'
config:
Expand Down
Loading