-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Schema][Utilization] Add schema tables for job utilization #6183
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
`workflow_id` UInt64, | ||
`job_id` UInt64, | ||
`run_attempt` UInt32, | ||
`workflow_template_id` UInt64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A q: what is this workflow_template_id
? Having some comments on the role of some new columns would be nice, i.e.
tags
, I guess this is where I can add any tags about the time series data I addjob_name
, a custom name for the time series data, or is this the job name from GitHub?json_data
, the time series data itself. There is a json datatype in ClickHouse, but it didn't work the last time I tried it https://clickhouse.com/docs/en/sql-reference/data-types/newjson. Maybe we can do a quick check to see if you can use this data type now. Otherwise, using a string here is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- workflow_template_id: (workflow_name) I changed it, so workflow_template_id is workflow_Id which identify a specific workflow template not workflow runner id, but I changed to record the workflow_name (ex "pull", "inductor" etc).
- job_name: the job's name from github(ex "test_inductor_")
workflowName and JobName will be used mainly for aggregations.
- tags: that is correct, you can add any tags for the time series.
- Json_data: the json data I store is in string format now not datatype JSON since Json type is not production-ready yet in clickhouse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some small remaining comments but overall LGTM! Let me know when you need to create them. We don't have a workflow to automate this atm, but I think Cat and I can create them manually for you (if I get the admin permission from Cat)
# Overview Add two tables in misc database for utilization data. - oss_ci_utilization_metadata: metadata - oss_ci_time_series: time-series table to store time-series data Utilization Data Pipeline Steps: 1. Modify monitor script for final data model (Done) 2. Add S3 bucket for ready-to-insert files (Done) 3. **Add Clickhouse database schemas (This Pr)** 4. Setup logic in upload_artifact to process log raw data and insert clean data into the ready-to-insert s3 bucket - notice we will generate two files, one for metadata table, and one for timeseries table. metadata table is single insertion, while time-series table is batch opertaion. 5. set up s3 replicator generator to insert table Doc Design https://docs.google.com/document/d/151uzLPpOTVcfdfDgFHmGqztiyWwHLI8OR-U3W9QH0lA/edit?tab=t.0 # Details TTL (time to live) All records are set time to live for a year using created_at timestamp, this gives us flexibility to re-insert hot data in the future. The data is backed up in S3, Use S3 replicator approach to insert data, see guidance: https://github.com/pytorch/test-infra/wiki/How-to-add-a-new-custom-table-on-ClickHouse See the data pipeline beflow: ![image](https://github.com/user-attachments/assets/87e1792b-6638-48d2-8613-efd7236f6426)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few optimizations I know of. CH has classes available online to learn more optimizations.
Also, it'll really help performance to create materialized views from these tables that preprocess most of the final query that you expect to run against these.
`type` String, | ||
`tags` Array(String) DEFAULT [], | ||
`time_stamp` DateTime64(0,'UTC'), | ||
`repo` String DEFAULT 'pytorch/pytorch', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many of the types here can be replaced by LowCardinality versions. As per the docs, if we expect to never have more than 10,000 distinct values in a column, LowCardinality can offer significant wins.
`repo` String DEFAULT 'pytorch/pytorch', | |
`repo` LowCardinality(String) DEFAULT 'pytorch/pytorch', |
Looking at this list, the following entries seem like good candidates for LowCardinality:
- type
- repo
- workflow_id (IIRC, every run of a workflow file, e.g. 'trunk', will have the same workflow id. Double check this though, because if this is a unique value every run then this should not be LowCardinality)
- run_attempt - this is the retry count, which will never reach anywhere near a thousand
- workflow_name
- job_name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL! I wonder if we could ALTER TABLE to have this in existing tables
-- This query creates the oss_ci_time_series table on ClickHouse | ||
CREATE TABLE misc.oss_ci_time_series( | ||
-- created_at DateTime when the record is processed in db. | ||
`created_at` DateTime64(0,'UTC'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: DateTime64 vs DateTime - DateTime64 offers nanosecond precision, while DateTime has second level granularity.
Since you've specified the precision as 0 (per-second granularity) might as well use DateTime instead.
-- type of time series, for instance, utilization log data is 'utilization'. | ||
`type` String, | ||
`tags` Array(String) DEFAULT [], | ||
`time_stamp` DateTime64(0,'UTC'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same DateTime comment
`time_stamp` DateTime64(0,'UTC'), | ||
`repo` String DEFAULT 'pytorch/pytorch', | ||
`workflow_id` UInt64, | ||
`run_attempt` UInt32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could even be UInt16.
I would have said UInt8, but one day in the next three years a job will be restarted 256 times and stuff will break :P
`gpu_count` UInt32, | ||
`cpu_count` UInt32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UInt16s would be good here too
( | ||
`created_at` DateTime64(0, 'UTC'), | ||
-- github info | ||
`repo` String DEFAULT 'pytorch/pytorch', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plenty of LowCardinality opportunities here
-- type of time series, for instance, utilization log data is 'utilization'. | ||
`type` String, | ||
`tags` Array(String) DEFAULT [], | ||
`time_stamp` DateTime64(0,'UTC'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that there's also some potential to optimize sequential data like timestamps. However, I'm not sure if it actually counts as sequential if the table is ordered by a different key.
More info here: https://clickhouse.com/blog/working-with-time-series-data-and-functions-ClickHouse#codecs-to-optimize-sequences-storage
@@ -0,0 +1,35 @@ | |||
-- This query creates the oss_ci_time_series table on ClickHouse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it's not a query
Overview
Add two tables in misc database for utilization data.
Utilization Data Pipeline Steps:
Doc Design
https://docs.google.com/document/d/151uzLPpOTVcfdfDgFHmGqztiyWwHLI8OR-U3W9QH0lA/edit?tab=t.0
Details
TTL (time to live)
All records are set time to live for a year using created_at timestamp, this gives us flexibility to re-insert hot data in the future.
The data is backed up in S3, Use S3 replicator approach to insert data, see guidance:
https://github.com/pytorch/test-infra/wiki/How-to-add-a-new-custom-table-on-ClickHouse
See the data pipeline beflow: