[Schema][Utilization] Add schema tables for job utilization #6183

yangw-dev · 2025-01-17T00:44:29Z

Overview

Add two tables in misc database for utilization data.

oss_ci_utilization_metadata: metadata
oss_ci_time_series: time-series table to store time-series data

Utilization Data Pipeline Steps:

Modify monitor script for final data model (Done)
Add S3 bucket for ready-to-insert files (Done)
Add Clickhouse database schemas (This Pr)
Setup logic in upload_artifact to process log raw data and insert clean data into the ready-to-insert s3 bucket
- notice we will generate two files, one for metadata table, and one for timeseries table. metadata table is single insertion, while time-series table is batch opertaion.
set up s3 replicator generator to insert table

Doc Design
https://docs.google.com/document/d/151uzLPpOTVcfdfDgFHmGqztiyWwHLI8OR-U3W9QH0lA/edit?tab=t.0

Details

TTL (time to live)
All records are set time to live for a year using created_at timestamp, this gives us flexibility to re-insert hot data in the future.

The data is backed up in S3, Use S3 replicator approach to insert data, see guidance:
https://github.com/pytorch/test-infra/wiki/How-to-add-a-new-custom-table-on-ClickHouse

See the data pipeline beflow:

vercel · 2025-01-17T00:44:33Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Updated (UTC)
torchci	⬜️ Ignored (Inspect)	Visit Preview	Jan 21, 2025 6:45pm

clickhouse_db_schema/oss_ci_utilization/schema.sql

huydhn · 2025-01-17T22:21:07Z

clickhouse_db_schema/oss_ci_utilization/schema.sql

+    `workflow_id` UInt64,
+    `job_id` UInt64,
+    `run_attempt` UInt32,
+    `workflow_template_id` UInt64,


A q: what is this workflow_template_id? Having some comments on the role of some new columns would be nice, i.e.

tags, I guess this is where I can add any tags about the time series data I add

job_name, a custom name for the time series data, or is this the job name from GitHub?

json_data, the time series data itself. There is a json datatype in ClickHouse, but it didn't work the last time I tried it https://clickhouse.com/docs/en/sql-reference/data-types/newjson. Maybe we can do a quick check to see if you can use this data type now. Otherwise, using a string here is fine

workflow_template_id: (workflow_name) I changed it, so workflow_template_id is workflow_Id which identify a specific workflow template not workflow runner id, but I changed to record the workflow_name (ex "pull", "inductor" etc).

job_name: the job's name from github(ex "test_inductor_")

workflowName and JobName will be used mainly for aggregations.

tags: that is correct, you can add any tags for the time series.

Json_data: the json data I store is in string format now not datatype JSON since Json type is not production-ready yet in clickhouse

clickhouse_db_schema/oss_ci_utilization/schema.sql

huydhn

There are some small remaining comments but overall LGTM! Let me know when you need to create them. We don't have a workflow to automate this atm, but I think Cat and I can create them manually for you (if I get the admin permission from Cat)

# Overview Add two tables in misc database for utilization data. - oss_ci_utilization_metadata: metadata - oss_ci_time_series: time-series table to store time-series data Utilization Data Pipeline Steps: 1. Modify monitor script for final data model (Done) 2. Add S3 bucket for ready-to-insert files (Done) 3. **Add Clickhouse database schemas (This Pr)** 4. Setup logic in upload_artifact to process log raw data and insert clean data into the ready-to-insert s3 bucket - notice we will generate two files, one for metadata table, and one for timeseries table. metadata table is single insertion, while time-series table is batch opertaion. 5. set up s3 replicator generator to insert table Doc Design https://docs.google.com/document/d/151uzLPpOTVcfdfDgFHmGqztiyWwHLI8OR-U3W9QH0lA/edit?tab=t.0 # Details TTL (time to live) All records are set time to live for a year using created_at timestamp, this gives us flexibility to re-insert hot data in the future. The data is backed up in S3, Use S3 replicator approach to insert data, see guidance: https://github.com/pytorch/test-infra/wiki/How-to-add-a-new-custom-table-on-ClickHouse See the data pipeline beflow: ![image](https://github.com/user-attachments/assets/87e1792b-6638-48d2-8613-efd7236f6426)

…6183)" This reverts commit abf8016.

ZainRizvi

Left a few optimizations I know of. CH has classes available online to learn more optimizations.

Also, it'll really help performance to create materialized views from these tables that preprocess most of the final query that you expect to run against these.

ZainRizvi · 2025-01-29T22:51:08Z

clickhouse_db_schema/oss_ci_utilization/oss_ci_time_series_schema.sql

+    `type` String,
+    `tags` Array(String) DEFAULT [],
+    `time_stamp` DateTime64(0,'UTC'),
+    `repo` String DEFAULT 'pytorch/pytorch',


Many of the types here can be replaced by LowCardinality versions. As per the docs, if we expect to never have more than 10,000 distinct values in a column, LowCardinality can offer significant wins.

Suggested change

`repo` String DEFAULT 'pytorch/pytorch',

`repo` LowCardinality(String) DEFAULT 'pytorch/pytorch',

Looking at this list, the following entries seem like good candidates for LowCardinality:

type

repo

workflow_id (IIRC, every run of a workflow file, e.g. 'trunk', will have the same workflow id. Double check this though, because if this is a unique value every run then this should not be LowCardinality)

run_attempt - this is the retry count, which will never reach anywhere near a thousand

workflow_name

job_name

TIL! I wonder if we could ALTER TABLE to have this in existing tables

ZainRizvi · 2025-01-29T22:56:01Z

clickhouse_db_schema/oss_ci_utilization/oss_ci_time_series_schema.sql

+-- This query creates the oss_ci_time_series table on ClickHouse
+CREATE TABLE misc.oss_ci_time_series(
+     -- created_at DateTime when the record is processed in db.
+    `created_at` DateTime64(0,'UTC'),


Note: DateTime64 vs DateTime - DateTime64 offers nanosecond precision, while DateTime has second level granularity.

Since you've specified the precision as 0 (per-second granularity) might as well use DateTime instead.

ZainRizvi · 2025-01-29T22:57:49Z

clickhouse_db_schema/oss_ci_utilization/oss_ci_time_series_schema.sql

+    -- type of time series, for instance, utilization log data is 'utilization'.
+    `type` String,
+    `tags` Array(String) DEFAULT [],
+    `time_stamp` DateTime64(0,'UTC'),


same DateTime comment

ZainRizvi · 2025-01-29T23:04:47Z

clickhouse_db_schema/oss_ci_utilization/oss_ci_time_series_schema.sql

+    `time_stamp` DateTime64(0,'UTC'),
+    `repo` String DEFAULT 'pytorch/pytorch',
+    `workflow_id` UInt64,
+    `run_attempt` UInt32,


this could even be UInt16.

I would have said UInt8, but one day in the next three years a job will be restarted 256 times and stuff will break :P

ZainRizvi · 2025-01-29T23:11:21Z

clickhouse_db_schema/oss_ci_utilization/oss_ci_utilization_metadata_schema.sql

+    `gpu_count` UInt32,
+    `cpu_count` UInt32,


UInt16s would be good here too

ZainRizvi · 2025-01-29T23:11:50Z

clickhouse_db_schema/oss_ci_utilization/oss_ci_utilization_metadata_schema.sql

+(
+    `created_at` DateTime64(0, 'UTC'),
+    -- github info
+    `repo` String DEFAULT 'pytorch/pytorch',


Plenty of LowCardinality opportunities here

ZainRizvi · 2025-01-29T23:13:35Z

clickhouse_db_schema/oss_ci_utilization/oss_ci_time_series_schema.sql

+    -- type of time series, for instance, utilization log data is 'utilization'.
+    `type` String,
+    `tags` Array(String) DEFAULT [],
+    `time_stamp` DateTime64(0,'UTC'),


Note that there's also some potential to optimize sequential data like timestamps. However, I'm not sure if it actually counts as sequential if the table is ordered by a different key.

More info here: https://clickhouse.com/blog/working-with-time-series-data-and-functions-ClickHouse#codecs-to-optimize-sequences-storage

ZainRizvi · 2025-01-29T23:14:18Z

clickhouse_db_schema/oss_ci_utilization/oss_ci_time_series_schema.sql

@@ -0,0 +1,35 @@
+-- This query creates the oss_ci_time_series table on ClickHouse


nit: it's not a query

yangw-dev added 5 commits January 16, 2025 16:41

test

f74eac6

test

661d7d1

test

71ecdce

test

3ac0af7

test

c653ada

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 17, 2025

yangw-dev requested a review from huydhn January 17, 2025 00:46

test

aea508a

yangw-dev marked this pull request as ready for review January 17, 2025 20:43

yangw-dev added 4 commits January 17, 2025 12:45

test

013f45a

test

2a5d557

test

0a52cc1

test

ad1740f

huydhn reviewed Jan 17, 2025

View reviewed changes

clickhouse_db_schema/oss_ci_utilization/schema.sql Outdated Show resolved Hide resolved

huydhn reviewed Jan 17, 2025

View reviewed changes

clickhouse_db_schema/oss_ci_utilization/schema.sql Outdated Show resolved Hide resolved

yangw-dev added 2 commits January 17, 2025 14:18

test

7957f06

test

54ac447

yangw-dev requested a review from huydhn January 17, 2025 22:20

huydhn reviewed Jan 17, 2025

View reviewed changes

clickhouse_db_schema/oss_ci_utilization/schema.sql Outdated Show resolved Hide resolved

huydhn approved these changes Jan 17, 2025

View reviewed changes

yangw-dev added 7 commits January 21, 2025 10:20

test

bdc96a3

test

10eaffe

test

060b877

test

93ab00f

test

ed2dbc2

test

70c51bc

test

bbfdc08

yangw-dev merged commit f24053f into main Jan 21, 2025
4 checks passed

yangw-dev deleted the addusageschema branch January 21, 2025 19:30

huydhn added a commit that referenced this pull request Jan 21, 2025

Revert "[Schema][Utilization] Add schema tables for job utilization (#…

f7fe3b5

…6183)" This reverts commit abf8016.

huydhn mentioned this pull request Jan 22, 2025

[Utilization][Usage Log] Add data model for record pytorch/pytorch#145114

Closed

Camyll pushed a commit that referenced this pull request Jan 22, 2025

Revert "[Schema][Utilization] Add schema tables for job utilization (#…

e5c863f

…6183)" This reverts commit abf8016.

Camyll pushed a commit that referenced this pull request Jan 22, 2025

Revert "[Schema][Utilization] Add schema tables for job utilization (#…

1615e6f

…6183)" This reverts commit abf8016.

ZainRizvi reviewed Jan 29, 2025

View reviewed changes

	`repo` String DEFAULT 'pytorch/pytorch',
	`repo` LowCardinality(String) DEFAULT 'pytorch/pytorch',

		@@ -0,0 +1,35 @@
		-- This query creates the oss_ci_time_series table on ClickHouse

[Schema][Utilization] Add schema tables for job utilization #6183

[Schema][Utilization] Add schema tables for job utilization #6183

Uh oh!

Conversation

yangw-dev commented Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Details

Uh oh!

vercel bot commented Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yangw-dev Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

huydhn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ZainRizvi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yangw-dev commented Jan 17, 2025 •

edited

Loading

vercel bot commented Jan 17, 2025 •

edited

Loading

yangw-dev Jan 21, 2025 •

edited

Loading

ZainRizvi left a comment •

edited

Loading