Description
Please, what is the correct table and approach to generate a time series of CPU
and GPU
utilization (with respect to total cluster CPU
and GPU
availability) from the cluster-trace-gpu-v2020
dataset? Currently, I am joining pai_machine_metric.csv
and pai_machine_spec.csv
tables then calculating the 1-minute utilization as:
# generate timestamps with relevant resolution with respect to earliest start time and latest end time
date_range = pd.date_range(df['start_time'].min(), df['end_time'].max(), freq='60s')
records = []
for d in date_range:
# find all records in pai_machine_metric where current timestamp, d, is within start and end time
match_df = df[(df['start_time']<d) &(df['end_time']>d)].copy()
# calculate the total number of CPUs and GPUs being utilized at current timestamp
# then divide by the total number of available CPUs and GPUs to get the utilization between [0, 1]
cpu_utlization = (match_df['machine_cpu']*match_df['cap_cpu']/100).sum()/match_df['cap_cpu'].sum()
gpu_utilization = (match_df['machine_gpu']/100).sum()/match_df['cap_gpu'].sum()
records.append(dict(
timestamp=d,
cpu_utilization=cpu_utilization,
gpu_utilization=gpu_utilization,
))
utilization_df = pd.DataFrame(records)
Is this a correct way to about it or should I be making use of a different table and/or approach?
Also, please could you clarify what the machine_load_1
variable in pai_machine_metric
is reporting? Specifically, what is the load
referring to?
Lastly, I am finding datapoints where the cap_gpu
is less than the machine_gpu/100
value, implying that more GPUs were utilized than available on the machine. How should I interpret such datapoints?