-
Notifications
You must be signed in to change notification settings - Fork 5
Upload final model training data #804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 20 commits
Commits
Show all changes
57 commits
Select commit
Hold shift + click to select a range
b019ad3
land_nbhd_rate_unique_by_town_nbhd_class_and_year
Damonamajor d5b50b5
Initial draft
Damonamajor 43b601d
update docs
Damonamajor 9bc6b2f
Merge branch 'master' of github.com:ccao-data/data-architecture
Damonamajor fe8f802
Add run_id to query
Damonamajor 93d9cce
add table_
Damonamajor dd494bf
Switch to training data
Damonamajor 1b2a69f
Update schema.yml
Damonamajor ac554d2
Update model-training_data.R
Damonamajor 4248b1f
Initial draft
Damonamajor 370cece
update docs
Damonamajor 78f1781
Add run_id to query
Damonamajor c1d2f3a
add table_
Damonamajor d9b3e3b
Switch to training data
Damonamajor 3f5aa58
Update schema.yml
Damonamajor 036ce36
Update model-training_data.R
Damonamajor 880b08f
Merge branch '798-upload-final-model-training-data' of github.com:cca…
Damonamajor 55ea677
Remove unintended commit
Damonamajor 4505249
Correct unique columns
Damonamajor b942b36
Error if > 2
Damonamajor e6bcfb2
Update etl/scripts-ccao-data-warehouse-us-east-1/model/model-training…
Damonamajor 6d85b17
Update etl/scripts-ccao-data-warehouse-us-east-1/model/model-training…
Damonamajor ca631d9
updated py script
Damonamajor 23b6b6d
Merge branch '798-upload-final-model-training-data' of github.com:cca…
Damonamajor c4c933e
styler
Damonamajor 42eb7d6
test2
Damonamajor 91fbedc
Fix ref
Damonamajor 172c0ce
Updated push
Damonamajor 71c57c2
updates
Damonamajor 9fe0d17
Another attempt
Damonamajor 9692b61
Resolve SSL errors and use Spark DataFrames for `model.training_data`…
jeancochrane 6e7c7da
Remove default file_format
Damonamajor bef997c
Add unique key
Damonamajor 96fc296
possible functional version
Damonamajor 1d9ec1c
Functional version
Damonamajor afe1b13
Remove old script
Damonamajor fb748a8
Remove unique_key
Damonamajor 5e6e3fe
Commenting
Damonamajor a9cdffa
Update docs.md
Damonamajor 795b64a
Update dbt/models/model/schema.yml
Damonamajor 3cbb70a
Update dbt/models/model/model.training_data.py
Damonamajor f7314c8
Update dbt/models/model/model.training_data.py
Damonamajor 4b34ef4
lintr
Damonamajor 9858c2f
Update dbt/models/model/docs.md
Damonamajor 483613a
Update dbt/models/model/model.training_data.py
Damonamajor 1b17f6f
Update dbt/models/model/model.training_data.py
Damonamajor 15f7af5
Update dbt/models/model/model.training_data.py
Damonamajor e622c1a
update to training_data
Damonamajor 06e2e76
update to training_data
Damonamajor ac8d278
Update schema
Damonamajor 0b290a5
update schema
Damonamajor 654dc6f
push attempt
Damonamajor 9f479d5
update model
Damonamajor 374c9f5
update naming
Damonamajor 27feb6e
update naming
Damonamajor 460bb54
Update dbt/models/model/schema.yml
Damonamajor 1dfe88b
Update dbt/models/model/schema.yml
Damonamajor File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
85 changes: 85 additions & 0 deletions
85
etl/scripts-ccao-data-warehouse-us-east-1/model/model-training_data.R
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,85 @@ | ||
| # Load libraries ---- | ||
| library(arrow) | ||
| library(DBI) | ||
| library(dplyr) | ||
| library(glue) | ||
| library(noctua) | ||
| library(purrr) | ||
| library(stringr) | ||
|
|
||
| # Once original data has been uploaded, | ||
| # we should only need to upload the current year data. | ||
| run_year <- format(Sys.Date(), "%Y") | ||
|
|
||
| # Connect to Athena | ||
| noctua_options(cache_size = 10) | ||
|
||
| conn <- dbConnect(noctua::athena(), rstudio_conn_tab = FALSE) | ||
| AWS_S3_WAREHOUSE_BUCKET <- "s3://ccao-data-warehouse-us-east-1" | ||
| output_bucket <- file.path(AWS_S3_WAREHOUSE_BUCKET, "model", "training_data") | ||
|
|
||
| # Query final model metadata | ||
| metadata <- dbGetQuery( | ||
| conn, | ||
| glue_sql( | ||
| " | ||
| SELECT | ||
| run_id, | ||
| year, | ||
| dvc_md5_assessment_data, | ||
| model_predictor_all_name | ||
| FROM model.metadata | ||
| WHERE run_type = 'final' | ||
| AND year IN ({run_year*}) | ||
| ", | ||
| .con = conn | ||
| ) | ||
| ) | ||
|
|
||
| # Iterate through each run | ||
| for (i in seq_len(nrow(metadata))) { | ||
|
||
| run_id <- metadata$run_id[i] | ||
| year <- metadata$year[i] | ||
| dvc_hash <- metadata$dvc_md5_assessment_data[i] | ||
| predictors_raw <- metadata$model_predictor_all_name[i] | ||
|
|
||
| # Clean predictor names | ||
| predictor_vars <- predictors_raw %>% | ||
| str_remove_all("^\\[|\\]$") %>% | ||
| str_split(",") %>% | ||
| unlist() %>% | ||
| trimws() | ||
|
|
||
| # Build DVC path depending on year | ||
| # The dvc path changes in 2023 | ||
| dvc_path <- if (as.integer(year) <= 2023) { | ||
| glue("s3://ccao-data-dvc-us-east-1/{substr(dvc_hash, 1, 2)}/{substr(dvc_hash, 3, 32)}") # nolint: line_length_linter | ||
| } else { | ||
| glue("s3://ccao-data-dvc-us-east-1/files/md5/{substr(dvc_hash, 1, 2)}/{substr(dvc_hash, 3, 32)}") # nolint: line_length_linter | ||
| } | ||
|
|
||
| message(glue("Processing run_id: {run_id}, year: {year}")) | ||
|
|
||
| # Read and filter training data | ||
| df <- open_dataset(dvc_path) %>% | ||
| select(meta_pin, meta_card_num, all_of(predictor_vars)) %>% | ||
| collect() | ||
|
|
||
| # Ensure known type mismatches are cast consistently | ||
| if ("ccao_is_active_exe_homeowner" %in% names(df)) { | ||
| df <- df %>% | ||
| mutate( | ||
| ccao_is_active_exe_homeowner = | ||
| as.logical(ccao_is_active_exe_homeowner) | ||
| ) | ||
| } | ||
Damonamajor marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| # Add run_id after cleaning types | ||
| df <- df %>% | ||
Damonamajor marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| mutate(run_id = run_id) %>% | ||
| group_by(run_id) %>% | ||
| write_partitions_to_s3( | ||
| output_bucket, | ||
| is_spatial = FALSE, | ||
| overwrite = TRUE | ||
| ) | ||
| } | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link won't work until it's merged I believe.