Upload final model training data

Currently, the only way to query model training data is to extract a DVC hash from `model.metadata.dvc_md5_training_data` and use the hash to download the training data Parquet file directly from S3. This is cumbersome, and also makes it impossible to query training data from the context of another Athena query, since loading data from a raw S3 path requires a scripted solution like R or Python.

One simple solution would be to automatically save training data to Athena for every run, the same way we do with `model.assessment_card`, but that feels like overkill in this case because training data rarely differs between non-final model runs. Instead, let's manually upload training data for our final models, and update our SoPs to remind ourselves to continue doing this in the future as part of the model finalization process.

Steps here include:

- [x] Write an S3 script in `data-architecture/etl/scripts-ccao-data-warehouse-us-east-1/model/model-training_data.parquet` with the following behavior: 
  * Pulls the run ID and DVC training data hash for all final models from `model.metadata`
  * For each run ID:
    * Downloads training data using the DVC hash
      * Note that you'll have to handle pre-2023 models differently, due to the change in DVC paths
    * Saves the training data to `s3://ccao-data-warehouse-us-east-1/model/training_data`, partitioned by run ID 
- [x] Create a Glue crawler called `ccao-data-warehouse-model-crawler` with the same settings as the other `ccao-data-warehouse-*-crawler` crawlers, except with the database `model` and the data source `s3://ccao-data-warehouse-us-east-1/model`
  * You may need S3 permissions to create a Glue crawler; if so, ping me
  * We should also persist this crawler config to https://github.com/ccao-data/aws-infrastructure/, let me know when you get to that step and we can pair on it
- [x] Run the script and the crawler and confirm that the resulting `model.training_data` table is correctly structured
- [x] Update the [`finalize-annual-model` issue template](https://github.com/ccao-data/model-res-avm/blob/master/.github/ISSUE_TEMPLATE/finalize-annual-model.md) and update the "Fetch the final data used to train the model using DVC" bullet to add a sub-bullet for running the `etl/` script you defined above for a specific run ID
- [x] Add documentation for the table and its columns to [`models/model/schema.yml` in `data-architecture`](https://github.com/ccao-data/data-architecture/blob/master/dbt/models/model/schema.yml)
  - [x] Make sure to document this as a `source` under the `sources` key 
  - [x] Add `unique_combination_of_columns` data test, too


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upload final model training data #798

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Upload final model training data #798

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions