Skip to content

Upload final model training data #798

@jeancochrane

Description

@jeancochrane

Currently, the only way to query model training data is to extract a DVC hash from model.metadata.dvc_md5_training_data and use the hash to download the training data Parquet file directly from S3. This is cumbersome, and also makes it impossible to query training data from the context of another Athena query, since loading data from a raw S3 path requires a scripted solution like R or Python.

One simple solution would be to automatically save training data to Athena for every run, the same way we do with model.assessment_card, but that feels like overkill in this case because training data rarely differs between non-final model runs. Instead, let's manually upload training data for our final models, and update our SoPs to remind ourselves to continue doing this in the future as part of the model finalization process.

Steps here include:

  • Write an S3 script in data-architecture/etl/scripts-ccao-data-warehouse-us-east-1/model/model-training_data.parquet with the following behavior:
    • Pulls the run ID and DVC training data hash for all final models from model.metadata
    • For each run ID:
      • Downloads training data using the DVC hash
        • Note that you'll have to handle pre-2023 models differently, due to the change in DVC paths
      • Saves the training data to s3://ccao-data-warehouse-us-east-1/model/training_data, partitioned by run ID
  • Create a Glue crawler called ccao-data-warehouse-model-crawler with the same settings as the other ccao-data-warehouse-*-crawler crawlers, except with the database model and the data source s3://ccao-data-warehouse-us-east-1/model
  • Run the script and the crawler and confirm that the resulting model.training_data table is correctly structured
  • Update the finalize-annual-model issue template and update the "Fetch the final data used to train the model using DVC" bullet to add a sub-bullet for running the etl/ script you defined above for a specific run ID
  • Add documentation for the table and its columns to models/model/schema.yml in data-architecture
    • Make sure to document this as a source under the sources key
    • Add unique_combination_of_columns data test, too

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions