-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Currently, the only way to query model training data is to extract a DVC hash from model.metadata.dvc_md5_training_data and use the hash to download the training data Parquet file directly from S3. This is cumbersome, and also makes it impossible to query training data from the context of another Athena query, since loading data from a raw S3 path requires a scripted solution like R or Python.
One simple solution would be to automatically save training data to Athena for every run, the same way we do with model.assessment_card, but that feels like overkill in this case because training data rarely differs between non-final model runs. Instead, let's manually upload training data for our final models, and update our SoPs to remind ourselves to continue doing this in the future as part of the model finalization process.
Steps here include:
- Write an S3 script in
data-architecture/etl/scripts-ccao-data-warehouse-us-east-1/model/model-training_data.parquetwith the following behavior:- Pulls the run ID and DVC training data hash for all final models from
model.metadata - For each run ID:
- Downloads training data using the DVC hash
- Note that you'll have to handle pre-2023 models differently, due to the change in DVC paths
- Saves the training data to
s3://ccao-data-warehouse-us-east-1/model/training_data, partitioned by run ID
- Downloads training data using the DVC hash
- Pulls the run ID and DVC training data hash for all final models from
- Create a Glue crawler called
ccao-data-warehouse-model-crawlerwith the same settings as the otherccao-data-warehouse-*-crawlercrawlers, except with the databasemodeland the data sources3://ccao-data-warehouse-us-east-1/model- You may need S3 permissions to create a Glue crawler; if so, ping me
- We should also persist this crawler config to https://github.com/ccao-data/aws-infrastructure/, let me know when you get to that step and we can pair on it
- Run the script and the crawler and confirm that the resulting
model.training_datatable is correctly structured - Update the
finalize-annual-modelissue template and update the "Fetch the final data used to train the model using DVC" bullet to add a sub-bullet for running theetl/script you defined above for a specific run ID - Add documentation for the table and its columns to
models/model/schema.ymlindata-architecture- Make sure to document this as a
sourceunder thesourceskey - Add
unique_combination_of_columnsdata test, too
- Make sure to document this as a