Skip to content
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
b019ad3
land_nbhd_rate_unique_by_town_nbhd_class_and_year
Damonamajor Mar 25, 2025
d5b50b5
Initial draft
Damonamajor May 1, 2025
43b601d
update docs
Damonamajor May 2, 2025
9bc6b2f
Merge branch 'master' of github.com:ccao-data/data-architecture
Damonamajor May 2, 2025
fe8f802
Add run_id to query
Damonamajor May 5, 2025
93d9cce
add table_
Damonamajor May 5, 2025
dd494bf
Switch to training data
Damonamajor May 5, 2025
1b2a69f
Update schema.yml
Damonamajor May 5, 2025
ac554d2
Update model-training_data.R
Damonamajor May 5, 2025
4248b1f
Initial draft
Damonamajor May 1, 2025
370cece
update docs
Damonamajor May 2, 2025
78f1781
Add run_id to query
Damonamajor May 5, 2025
c1d2f3a
add table_
Damonamajor May 5, 2025
d9b3e3b
Switch to training data
Damonamajor May 5, 2025
3f5aa58
Update schema.yml
Damonamajor May 5, 2025
036ce36
Update model-training_data.R
Damonamajor May 5, 2025
880b08f
Merge branch '798-upload-final-model-training-data' of github.com:cca…
Damonamajor May 5, 2025
55ea677
Remove unintended commit
Damonamajor May 5, 2025
4505249
Correct unique columns
Damonamajor May 5, 2025
b942b36
Error if > 2
Damonamajor May 5, 2025
e6bcfb2
Update etl/scripts-ccao-data-warehouse-us-east-1/model/model-training…
Damonamajor May 12, 2025
6d85b17
Update etl/scripts-ccao-data-warehouse-us-east-1/model/model-training…
Damonamajor May 12, 2025
ca631d9
updated py script
Damonamajor May 14, 2025
23b6b6d
Merge branch '798-upload-final-model-training-data' of github.com:cca…
Damonamajor May 14, 2025
c4c933e
styler
Damonamajor May 14, 2025
42eb7d6
test2
Damonamajor May 14, 2025
91fbedc
Fix ref
Damonamajor May 14, 2025
172c0ce
Updated push
Damonamajor May 14, 2025
71c57c2
updates
Damonamajor May 15, 2025
9fe0d17
Another attempt
Damonamajor May 15, 2025
9692b61
Resolve SSL errors and use Spark DataFrames for `model.training_data`…
jeancochrane May 19, 2025
6e7c7da
Remove default file_format
Damonamajor May 19, 2025
bef997c
Add unique key
Damonamajor May 19, 2025
96fc296
possible functional version
Damonamajor May 20, 2025
1d9ec1c
Functional version
Damonamajor May 20, 2025
afe1b13
Remove old script
Damonamajor May 20, 2025
fb748a8
Remove unique_key
Damonamajor May 20, 2025
5e6e3fe
Commenting
Damonamajor May 20, 2025
a9cdffa
Update docs.md
Damonamajor May 20, 2025
795b64a
Update dbt/models/model/schema.yml
Damonamajor May 20, 2025
3cbb70a
Update dbt/models/model/model.training_data.py
Damonamajor May 20, 2025
f7314c8
Update dbt/models/model/model.training_data.py
Damonamajor May 20, 2025
4b34ef4
lintr
Damonamajor May 21, 2025
9858c2f
Update dbt/models/model/docs.md
Damonamajor May 22, 2025
483613a
Update dbt/models/model/model.training_data.py
Damonamajor May 22, 2025
1b17f6f
Update dbt/models/model/model.training_data.py
Damonamajor May 22, 2025
15f7af5
Update dbt/models/model/model.training_data.py
Damonamajor May 22, 2025
e622c1a
update to training_data
Damonamajor May 22, 2025
06e2e76
update to training_data
Damonamajor May 22, 2025
ac8d278
Update schema
Damonamajor May 22, 2025
0b290a5
update schema
Damonamajor May 27, 2025
654dc6f
push attempt
Damonamajor May 28, 2025
9f479d5
update model
Damonamajor May 28, 2025
374c9f5
update naming
Damonamajor May 28, 2025
27feb6e
update naming
Damonamajor May 28, 2025
460bb54
Update dbt/models/model/schema.yml
Damonamajor May 28, 2025
1dfe88b
Update dbt/models/model/schema.yml
Damonamajor May 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions dbt/models/model/docs.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,17 @@ Wall time of each stage (train, assess, etc.) for each model run (`run_id`).
**Primary Key**: `year`, `run_id`
{% enddocs %}

# training_data

{% docs table_training_data %}

A table containing the training data from the final model runs. This is uploaded
manually at the end of modeling via the [`S3 model-training_data.R`](https://github.com/ccao-data/data-architecture/tree/master/etl/scripts-ccao-data-warehouse-us-east-1/model/model-training_data.R)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link won't work until it's merged I believe.

script.

**Primary Key**: `run_id`, `meta_card_num`, `meta_pin`
{% enddocs %}

# vw_card_res_input

{% docs view_vw_card_res_input %}
Expand Down
14 changes: 14 additions & 0 deletions dbt/models/model/schema.yml
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,20 @@ sources:
- year
- run_id

- name: training_data
description: '{{ doc("table_training_data") }}'
tags:
- load_auto
data_tests:
- unique_combination_of_columns:
name: model_training_data_unique_run_id_pin_card_num
combination_of_columns:
- run_id
- meta_pin
- meta_card_num
config:
error_if: ">2"

models:
- name: model.final_model
description: '{{ doc("table_final_model") }}'
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Load libraries ----
library(arrow)
library(DBI)
library(dplyr)
library(glue)
library(noctua)
library(purrr)
library(stringr)

# Once original data has been uploaded,
# we should only need to upload the current year data.
run_year <- format(Sys.Date(), "%Y")

# Connect to Athena
noctua_options(cache_size = 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's scrub the cache syntax and use the unload option instead.

conn <- dbConnect(noctua::athena(), rstudio_conn_tab = FALSE)
AWS_S3_WAREHOUSE_BUCKET <- "s3://ccao-data-warehouse-us-east-1"
output_bucket <- file.path(AWS_S3_WAREHOUSE_BUCKET, "model", "training_data")

# Query final model metadata
metadata <- dbGetQuery(
conn,
glue_sql(
"
SELECT
run_id,
year,
dvc_md5_assessment_data,
model_predictor_all_name
FROM model.metadata
WHERE run_type = 'final'
AND year IN ({run_year*})
",
.con = conn
)
)

# Iterate through each run
for (i in seq_len(nrow(metadata))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest using our usual

pwalk(metadata, \(...) {
  df <- tibble::tibble(...)

syntax here.

run_id <- metadata$run_id[i]
year <- metadata$year[i]
dvc_hash <- metadata$dvc_md5_assessment_data[i]
predictors_raw <- metadata$model_predictor_all_name[i]

# Clean predictor names
predictor_vars <- predictors_raw %>%
str_remove_all("^\\[|\\]$") %>%
str_split(",") %>%
unlist() %>%
trimws()

# Build DVC path depending on year
# The dvc path changes in 2023
dvc_path <- if (as.integer(year) <= 2023) {
glue("s3://ccao-data-dvc-us-east-1/{substr(dvc_hash, 1, 2)}/{substr(dvc_hash, 3, 32)}") # nolint: line_length_linter
} else {
glue("s3://ccao-data-dvc-us-east-1/files/md5/{substr(dvc_hash, 1, 2)}/{substr(dvc_hash, 3, 32)}") # nolint: line_length_linter
}

message(glue("Processing run_id: {run_id}, year: {year}"))

# Read and filter training data
df <- open_dataset(dvc_path) %>%
select(meta_pin, meta_card_num, all_of(predictor_vars)) %>%
collect()

# Ensure known type mismatches are cast consistently
if ("ccao_is_active_exe_homeowner" %in% names(df)) {
df <- df %>%
mutate(
ccao_is_active_exe_homeowner =
as.logical(ccao_is_active_exe_homeowner)
)
}

# Add run_id after cleaning types
df <- df %>%
mutate(run_id = run_id) %>%
group_by(run_id) %>%
write_partitions_to_s3(
output_bucket,
is_spatial = FALSE,
overwrite = TRUE
)
}
Loading