Skip to content

How to generate CPU and GPU utilization time series for cluster-trace-gpu-v2020 dataset #226

Open
@kingsleynweye

Description

@kingsleynweye

Please, what is the correct table and approach to generate a time series of CPU and GPU utilization (with respect to total cluster CPU and GPU availability) from the cluster-trace-gpu-v2020 dataset? Currently, I am joining pai_machine_metric.csv and pai_machine_spec.csv tables then calculating the 1-minute utilization as:

# generate timestamps with relevant resolution with respect to earliest start time and latest end time
date_range = pd.date_range(df['start_time'].min(), df['end_time'].max(), freq='60s')
records = []

for d in date_range:
  # find all records in pai_machine_metric where current timestamp, d, is within start and end time
  match_df = df[(df['start_time']<d) &(df['end_time']>d)].copy()

  # calculate the total number of CPUs and GPUs being utilized at current timestamp
  # then divide by the total number of available CPUs and GPUs to get the utilization between [0, 1]
  cpu_utlization = (match_df['machine_cpu']*match_df['cap_cpu']/100).sum()/match_df['cap_cpu'].sum()
  gpu_utilization = (match_df['machine_gpu']/100).sum()/match_df['cap_gpu'].sum()
  records.append(dict(
    timestamp=d,
    cpu_utilization=cpu_utilization,
    gpu_utilization=gpu_utilization,
  ))

utilization_df = pd.DataFrame(records)

Is this a correct way to about it or should I be making use of a different table and/or approach?

Also, please could you clarify what the machine_load_1 variable in pai_machine_metric is reporting? Specifically, what is the load referring to?

Lastly, I am finding datapoints where the cap_gpu is less than the machine_gpu/100 value, implying that more GPUs were utilized than available on the machine. How should I interpret such datapoints?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions