Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Schema][Utilization] Add schema tables for job utilization #6183

Merged
merged 19 commits into from
Jan 21, 2025

Conversation

yangw-dev
Copy link
Contributor

@yangw-dev yangw-dev commented Jan 17, 2025

Overview

Add two tables in misc database for utilization data.

  • oss_ci_utilization_metadata: metadata
  • oss_ci_time_series: time-series table to store time-series data

Utilization Data Pipeline Steps:

  1. Modify monitor script for final data model (Done)
  2. Add S3 bucket for ready-to-insert files (Done)
  3. Add Clickhouse database schemas (This Pr)
  4. Setup logic in upload_artifact to process log raw data and insert clean data into the ready-to-insert s3 bucket
    • notice we will generate two files, one for metadata table, and one for timeseries table. metadata table is single insertion, while time-series table is batch opertaion.
  5. set up s3 replicator generator to insert table

Doc Design
https://docs.google.com/document/d/151uzLPpOTVcfdfDgFHmGqztiyWwHLI8OR-U3W9QH0lA/edit?tab=t.0

Details

TTL (time to live)
All records are set time to live for a year using created_at timestamp, this gives us flexibility to re-insert hot data in the future.

The data is backed up in S3, Use S3 replicator approach to insert data, see guidance:
https://github.com/pytorch/test-infra/wiki/How-to-add-a-new-custom-table-on-ClickHouse

See the data pipeline beflow:
image

Copy link

vercel bot commented Jan 17, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Updated (UTC)
torchci ⬜️ Ignored (Inspect) Visit Preview Jan 21, 2025 6:45pm

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 17, 2025
@yangw-dev yangw-dev requested a review from huydhn January 17, 2025 00:46
@yangw-dev yangw-dev marked this pull request as ready for review January 17, 2025 20:43
@yangw-dev yangw-dev requested a review from huydhn January 17, 2025 22:20
`workflow_id` UInt64,
`job_id` UInt64,
`run_attempt` UInt32,
`workflow_template_id` UInt64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A q: what is this workflow_template_id? Having some comments on the role of some new columns would be nice, i.e.

  • tags, I guess this is where I can add any tags about the time series data I add
  • job_name, a custom name for the time series data, or is this the job name from GitHub?
  • json_data, the time series data itself. There is a json datatype in ClickHouse, but it didn't work the last time I tried it https://clickhouse.com/docs/en/sql-reference/data-types/newjson. Maybe we can do a quick check to see if you can use this data type now. Otherwise, using a string here is fine

Copy link
Contributor Author

@yangw-dev yangw-dev Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • workflow_template_id: (workflow_name) I changed it, so workflow_template_id is workflow_Id which identify a specific workflow template not workflow runner id, but I changed to record the workflow_name (ex "pull", "inductor" etc).
  • job_name: the job's name from github(ex "test_inductor_")

workflowName and JobName will be used mainly for aggregations.

  • tags: that is correct, you can add any tags for the time series.
  • Json_data: the json data I store is in string format now not datatype JSON since Json type is not production-ready yet in clickhouse

Copy link
Contributor

@huydhn huydhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some small remaining comments but overall LGTM! Let me know when you need to create them. We don't have a workflow to automate this atm, but I think Cat and I can create them manually for you (if I get the admin permission from Cat)

@yangw-dev yangw-dev merged commit f24053f into main Jan 21, 2025
4 checks passed
@yangw-dev yangw-dev deleted the addusageschema branch January 21, 2025 19:30
Camyll pushed a commit that referenced this pull request Jan 21, 2025
# Overview
Add two tables in misc database for utilization data.
- oss_ci_utilization_metadata: metadata
- oss_ci_time_series: time-series table to store time-series data

Utilization Data Pipeline Steps:
1. Modify monitor script for final data model (Done)
2. Add S3 bucket for ready-to-insert files  (Done)
3. **Add Clickhouse database schemas  (This Pr)**
4. Setup logic in upload_artifact to process log raw data and insert
clean data into the ready-to-insert s3 bucket
- notice we will generate two files, one for metadata table, and one for
timeseries table. metadata table is single insertion, while time-series
table is batch opertaion.
5. set up s3 replicator generator to insert table

Doc Design

https://docs.google.com/document/d/151uzLPpOTVcfdfDgFHmGqztiyWwHLI8OR-U3W9QH0lA/edit?tab=t.0

# Details
TTL (time to live)
All records are set time to live for a year using created_at timestamp,
this gives us flexibility to re-insert hot data in the future.

The data is backed up in S3, Use S3 replicator approach to insert data,
see guidance:

https://github.com/pytorch/test-infra/wiki/How-to-add-a-new-custom-table-on-ClickHouse

See the data pipeline beflow:

![image](https://github.com/user-attachments/assets/87e1792b-6638-48d2-8613-efd7236f6426)
huydhn added a commit that referenced this pull request Jan 21, 2025
Camyll pushed a commit that referenced this pull request Jan 22, 2025
Camyll pushed a commit that referenced this pull request Jan 22, 2025
Copy link
Contributor

@ZainRizvi ZainRizvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few optimizations I know of. CH has classes available online to learn more optimizations.

Also, it'll really help performance to create materialized views from these tables that preprocess most of the final query that you expect to run against these.

`type` String,
`tags` Array(String) DEFAULT [],
`time_stamp` DateTime64(0,'UTC'),
`repo` String DEFAULT 'pytorch/pytorch',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of the types here can be replaced by LowCardinality versions. As per the docs, if we expect to never have more than 10,000 distinct values in a column, LowCardinality can offer significant wins.

Suggested change
`repo` String DEFAULT 'pytorch/pytorch',
`repo` LowCardinality(String) DEFAULT 'pytorch/pytorch',

Looking at this list, the following entries seem like good candidates for LowCardinality:

  • type
  • repo
  • workflow_id (IIRC, every run of a workflow file, e.g. 'trunk', will have the same workflow id. Double check this though, because if this is a unique value every run then this should not be LowCardinality)
  • run_attempt - this is the retry count, which will never reach anywhere near a thousand
  • workflow_name
  • job_name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL! I wonder if we could ALTER TABLE to have this in existing tables

-- This query creates the oss_ci_time_series table on ClickHouse
CREATE TABLE misc.oss_ci_time_series(
-- created_at DateTime when the record is processed in db.
`created_at` DateTime64(0,'UTC'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: DateTime64 vs DateTime - DateTime64 offers nanosecond precision, while DateTime has second level granularity.

Since you've specified the precision as 0 (per-second granularity) might as well use DateTime instead.

-- type of time series, for instance, utilization log data is 'utilization'.
`type` String,
`tags` Array(String) DEFAULT [],
`time_stamp` DateTime64(0,'UTC'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same DateTime comment

`time_stamp` DateTime64(0,'UTC'),
`repo` String DEFAULT 'pytorch/pytorch',
`workflow_id` UInt64,
`run_attempt` UInt32,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could even be UInt16.

I would have said UInt8, but one day in the next three years a job will be restarted 256 times and stuff will break :P

Comment on lines +15 to +16
`gpu_count` UInt32,
`cpu_count` UInt32,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UInt16s would be good here too

(
`created_at` DateTime64(0, 'UTC'),
-- github info
`repo` String DEFAULT 'pytorch/pytorch',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plenty of LowCardinality opportunities here

-- type of time series, for instance, utilization log data is 'utilization'.
`type` String,
`tags` Array(String) DEFAULT [],
`time_stamp` DateTime64(0,'UTC'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that there's also some potential to optimize sequential data like timestamps. However, I'm not sure if it actually counts as sequential if the table is ordered by a different key.

More info here: https://clickhouse.com/blog/working-with-time-series-data-and-functions-ClickHouse#codecs-to-optimize-sequences-storage

@@ -0,0 +1,35 @@
-- This query creates the oss_ci_time_series table on ClickHouse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's not a query

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants