Usage reporting in SageMaker HyperPod EKS-orchestrated clusters provides visibility into compute resource consumption. The capability allows organizations to implement transparent cost attribution, allocating cluster costs to teams, projects, or departments based on their actual usage. By tracking metrics such as GPU/CPU hours, and Neuron Core utilization over time, usage reporting complements SageMaker HyperPod's Task Governance functionality, ensuring fair cost attribution in shared multi-tenant clusters by:
- Eliminating guesswork in cost allocation
- Directly linking expenses to measurable resource consumption
- Enforcing usage-based accountability in shared infrastructure environments
- Set up Usage Reporting
- Generate Reports
- Clean Up Resources
- Local Development
- Attributions and Open Source Acknowledgments
- Contributing
- License
Usage reporting in SageMaker HyperPod requires deploying the SageMaker HyperPod usage report infrastructure using a CloudFormation stack and installing the SageMaker HyperPod usage report Kubernetes operator using a Helm chart.
To successfully deploy and use the SageMaker HyperPod usage report, you should meet the following prerequisites.
-
Have a running EKS-orchestrated SageMaker HyperPod cluster (Kubernetes version >= 1.30) with the Task Governance add-on.
-
Have AWS CLI, kubectl, and Helm (package manager for Kubernetes - version >= 3.17.1) installed.
-
A Python environment (version >= 3.9).
-
Clone the GitHub repository sagemaker-hyperpod-usage-report.
git clone https://github.com/awslabs/sagemaker-hyperpod-usage-report
-
Set the following local environment variables in your terminal:
Note
- To install the usage report, you need an Installer IAM role and with appropriate permissions. You can either create a new IAM role and leave the role policies blank for now, or reuse an existing role such as your current administrator role. Use the selected role name in the
USAGE_REPORT_INSTALLER_ROLE_NAMEvariable. You will populate the role policies in the upcoming configuration steps.
# Set up the environment variable export AWS_ACCOUNT=<account number> export AWS_REGION=<region> export HYPERPOD_CLUSTER_NAME=<hyperpod cluster name> export EKS_CLUSTER_NAME=<eks cluster name> export USAGE_REPORT_INSTALLER_ROLE_NAME=<Installer IAM role name> export USAGE_REPORT_OPERATOR_NAME=hyperpod-usage-report <keep under 22 characters if custom> export HYPERPOD_CLUSTER_ID=$(aws sagemaker describe-cluster --cluster-name ml-cluster --region $AWS_REGION | jq -r '.ClusterArn | split("/")[-1]') aws configure set region $AWS_REGION
Verify the content of your variables:
echo "AWS_ACCOUNT is $AWS_ACCOUNT" echo "AWS_REGION is $AWS_REGION" echo "HYPERPOD_CLUSTER_NAME is $HYPERPOD_CLUSTER_NAME" echo "EKS_CLUSTER_NAME is $EKS_CLUSTER_NAME" echo "USAGE_REPORT_INSTALLER_ROLE_NAME is $USAGE_REPORT_INSTALLER_ROLE_NAME" echo "USAGE_REPORT_OPERATOR_NAME is $USAGE_REPORT_OPERATOR_NAME" echo "HYPERPOD_CLUSTER_ID is $HYPERPOD_CLUSTER_ID"
- To install the usage report, you need an Installer IAM role and with appropriate permissions. You can either create a new IAM role and leave the role policies blank for now, or reuse an existing role such as your current administrator role. Use the selected role name in the
-
Set up
kubectlauthentication and context for accessing the EKS cluster-
Start by running the
aws eks update-kubeconfigcommand to update your local kube config file (located at ~/.kube/config) with the credentials and configuration needed to connect to your EKS cluster using thekubectlcommand.aws eks update-kubeconfig --region $AWS_REGION --name $EKS_CLUSTER_NAME
-
You can verify that you are connected to the EKS cluster by running:
kubectl config current-context
arn:aws:eks:$AWS_REGION:$AWS_ACCOUNT:cluster/$EKS_CLUSTER_NAME
-
-
Generate and attach the required IAM policies.
-
Populate the IAM policy document for your Installer role from the template provided in
permissions/usage-report-installer-policy.json.template.INPUT_FILE="permissions/usage-report-installer-policy.json.template" OUTPUT_FILE="permissions/usage-report-installer-policy.json" sed \ -e "s/AWS_REGION/$AWS_REGION/g" \ -e "s/AWS_ACCOUNT/$AWS_ACCOUNT/g" \ -e "s/USAGE_REPORT_OPERATOR_NAME/$USAGE_REPORT_OPERATOR_NAME/g" \ -e "s/HYPERPOD_CLUSTER_ID/$HYPERPOD_CLUSTER_ID/g" \ -e "s/EKS_CLUSTER_NAME/$EKS_CLUSTER_NAME/g" \ -e "s/USAGE_REPORT_INSTALLER_ROLE_NAME/$USAGE_REPORT_INSTALLER_ROLE_NAME/g" \ "$INPUT_FILE" > "$OUTPUT_FILE"
-
Attach the
permissions/usage-report-installer-policy.jsonIAM policy to the IAM Installer role that performs AWS CLI, kubectl, and helm operations. This ensures usage report installers have the required permissions to install and manage SageMaker HyperPod Usage report data capture.To embed the inline policy in an existing role, use the following command:
aws iam put-role-policy \ --role-name $USAGE_REPORT_INSTALLER_ROLE_NAME \ --policy-name sagemaker-hyperpod-usage-report \ --policy-document file://permissions/usage-report-installer-policy.jsonTo verify that the policy has been added correctly, run:
aws iam get-role-policy \ --role-name $USAGE_REPORT_INSTALLER_ROLE_NAME \ --policy-name sagemaker-hyperpod-usage-report
-
-
Create a dedicated Kubernetes namespace for the usage report operator:
-
In
sagemaker-hyperpod-usage-report, run the following command to create the namespace$USAGE_REPORT_OPERATOR_NAME:INPUT_FILE="permissions/usage-report-namespace.yaml.template" OUTPUT_FILE="permissions/usage-report-namespace.yaml" sed \ -e "s/NAMESPACE/$USAGE_REPORT_OPERATOR_NAME/g" \ "$INPUT_FILE" > "$OUTPUT_FILE" kubectl apply -f permissions/usage-report-namespace.yaml
-
-
Create custom RBAC permissions for deploying the HyperPod usage report Kubernetes operator helm chart on the cluster:
-
In
sagemaker-hyperpod-usage-report, run the following command to setup the RBAC permissions in your EKS cluster.INPUT_FILE="permissions/usage-report-installer-cluster-policy.yaml.template" OUTPUT_FILE="permissions/usage-report-installer-cluster-policy.yaml" sed \ -e "s/NAMESPACE/$USAGE_REPORT_OPERATOR_NAME/g" \ -e "s/ROLE_NAME/$USAGE_REPORT_INSTALLER_ROLE_NAME/g" \ "$INPUT_FILE" > "$OUTPUT_FILE" kubectl apply -f permissions/usage-report-installer-cluster-policy.yaml
-
Enable the access entry for the EKS cluster.
aws eks update-cluster-config --name $EKS_CLUSTER_NAME --access-config authenticationMode=API_AND_CONFIG_MAPNote: If you receive an error message indicating Unsupported authentication mode update, no further action is necessary as the authentication mode has already been configured.
-
The following installation assume you are using the role USAGE_REPORT_INSTALLER_ROLE_NAME you specified above.
You can find the CloudFormation template in the /cloudformation directory. The template provisions the following AWS resources:
- Storage infrastructure: An S3 bucket (
s3://$AWS_ACCOUNT-$AWS_REGION-$HYPERPOD_CLUSTER_ID-usage-report-<random string>) to capture usage data, with associated IAM role allowing pods to write data to the bucket. - Query infrastructure: An Athena database for querying and aggregating usage data.
- Processing infrastructure: An AWS Lambda function triggered daily by a CloudWatch Event rule to perform automated usage data aggregation and reporting.
| Parameter | Required | Default Value | Notes |
|---|---|---|---|
| EKSClusterName | Yes | - | Name of the EKS cluster |
| HyperPodClusterId | Yes | - | Id of the HyperPod cluster |
| UsageReportInstallerRoleName | Yes | - | Name of the IAM role for usage reporting installation |
| DataRententionDays | No | 180 | Data retention days for S3 Bucket |
| InstallPodIdentityAddon | No | "true" | Whether to install the Pod Identity Addon. Allowed values: "true", "false" |
| UsageReportOperatorNameSpace | No | hyperpod-usage-report | Kubernetes cluster namespace where usage report operator is installed |
| OperatorServiceAccount | No | hyperpod-usage-report | Service account used by usage report operator pod identity for permissions to access AWS resources |
Run the following stack creation command:
cd sagemaker-hyperpod-usage-report
aws cloudformation create-stack \
--region $AWS_REGION \
--stack-name $USAGE_REPORT_OPERATOR_NAME \
--template-body file://cloudformation/usage-report.yaml \
--capabilities CAPABILITY_NAMED_IAM \
--parameters \
ParameterKey=EKSClusterName,ParameterValue=$EKS_CLUSTER_NAME \
ParameterKey=HyperPodClusterId,ParameterValue=$HYPERPOD_CLUSTER_ID \
ParameterKey=UsageReportOperatorNameSpace,ParameterValue=$USAGE_REPORT_OPERATOR_NAME \
ParameterKey=OperatorServiceAccount,ParameterValue=$USAGE_REPORT_OPERATOR_NAME \
ParameterKey=UsageReportInstallerRoleName,ParameterValue=$USAGE_REPORT_INSTALLER_ROLE_NAMEVerify the CloudFormation stack creation status:
aws cloudformation describe-stacks --stack-name $USAGE_REPORT_OPERATOR_NAME \
--region $AWS_REGION --query 'Stacks[0].StackStatus' --output text| Output Name | Description |
|---|---|
| DatabaseName | Name of the created database |
| UsageReportBucket | Name of the created S3 Bucket |
- If the CloudFormation stack status indicates a
ROLLBACKstate, you can investigate the failure reason by using the AWS CLI command below or by checking the AWS CloudFormation console directly:aws cloudformation describe-stack-events \ --stack-name $USAGE_REPORT_OPERATOR_NAME \ --query 'StackEvents[?ResourceStatus==`CREATE_FAILED`].[LogicalResourceId,ResourceStatusReason]' - Ensure that the
eks-auth:AssumeRoleForPodIdentitypermission is included in the IAM execution role for the SageMaker HyperPod cluster. - If the stack creation fails with the error
eks-pod-identity-agent already exists, recreate the stack with the additional parametersParameterKey=InstallPodIdentityAddon,ParameterValue=false:aws cloudformation create-stack \ --region $AWS_REGION \ --stack-name $USAGE_REPORT_OPERATOR_NAME \ --template-body file://cloudformation/usage-report.yaml \ --capabilities CAPABILITY_NAMED_IAM \ --parameters \ ParameterKey=EKSClusterName,ParameterValue=$EKS_CLUSTER_NAME \ ParameterKey=HyperPodClusterId,ParameterValue=$HYPERPOD_CLUSTER_ID \ ParameterKey=UsageReportInstallerRoleName,ParameterValue=$USAGE_REPORT_INSTALLER_ROLE_NAME \ ParameterKey=UsageReportOperatorNameSpace,ParameterValue=$USAGE_REPORT_OPERATOR_NAME \ ParameterKey=OperatorServiceAccount,ParameterValue=$USAGE_REPORT_OPERATOR_NAME \ ParameterKey=InstallPodIdentityAddon,ParameterValue=false
The values.yaml Helm chart in the /helm_chart directory configures the SageMaker HyperPod usage report Kubernetes operator, which provisions and manages the following cluster resources:
- Namespace:
hyperpod-usage-report(default) - Service Account:
hyperpod-usage-report(default) - RBAC rules granting the operator cluster-scoped permissions to:
- Monitor cluster resources (clusterqueues, workloads, namespaces, pods)
- Retrieve node-level metadata
- Manage leader election (if there are multiple replicas of the operator) using Kubernetes leases
- Kubernetes operator collecting and storing usage report data in S3.
You can configure the Helm chart by either updating the values.yaml file or by providing parameters directly during the helm install command. Any values passed as parameters during installation override the settings in the values.yaml file.
| Parameter | Description | Default Value | Required | Notes |
|---|---|---|---|---|
| replicaCount | Number of operator replicas to run | 2 | No | |
| namespace | Namespace where the operator will be installed | "hyperpod-usage-report" | No | Can be modified to deploy in a different namespace |
| serviceAccount.name | Name of the service account | "hyperpod-usage-report" | No | Can be modified if using custom naming |
| s3BucketName | Name of the S3 bucket that was created from the cloudformation template | Yes | Operator will start storing the usage report data in this bucket. | |
| clusterName | Name of the EKS Cluster | Yes | ||
| region | Specify the AWS region | Yes | example: us-west-2 |
To install the Helm chart, run the following command:
cd helm_chart
# retrieve s3 bucket name
USAGE_REPORT_S3_BUCKET=$(aws cloudformation describe-stack-resources \
--stack-name $USAGE_REPORT_OPERATOR_NAME \
--query 'StackResources[?ResourceType==`AWS::S3::Bucket`].PhysicalResourceId' \
--output text)
# verification
echo $USAGE_REPORT_S3_BUCKET
helm install $USAGE_REPORT_OPERATOR_NAME \
./SageMakerHyperPodUsageReportChart \
-n $USAGE_REPORT_OPERATOR_NAME \
--set region=$AWS_REGION \
--set serviceAccount.name=$USAGE_REPORT_OPERATOR_NAME \
--set clusterName=$HYPERPOD_CLUSTER_NAME \
--set s3BucketName=$USAGE_REPORT_S3_BUCKETVerify the operator installation:
kubectl get pods -n $USAGE_REPORT_OPERATOR_NAMEYou can start submitting jobs to the cluster. Raw job usage data is stored in the S3 bucket path $USAGE_REPORT_S3_BUCKET/raw/.
Notes
- Before install the operator through helm chart, make sure the HyperPod Usage Report cloudformation stack is completed.
- A pre-existing namespace
$USAGE_REPORT_OPERATOR_NAMEis required to install the helm chart (check withkubectl get namspaces). If you don't have it yet, please refer to prerequisite to create namespace. - When uninstalling the
$USAGE_REPORT_OPERATOR_NAMEhelm chart, the associated namespace is automatically deleted, which invalidates the RBAC permissions. You must restore the namespace-level RBAC configurations previously set in the cluster by re-running the steps in the prerequisites section.
You can use the run.py script to extract and export usage metrics for your SageMaker HyperPod cluster.
cd sagemaker-hyperpod-usage-report/report_generation
pip install -e .
# retrieve Athena database name
USAGE_REPORT_DATABASE=$(aws cloudformation describe-stack-resources \
--stack-name $USAGE_REPORT_OPERATOR_NAME \
--query 'StackResources[?ResourceType==`AWS::Glue::Database`].PhysicalResourceId' \
--output text)
DATABASE_WORKGROUP_NAME=$(aws cloudformation describe-stack-resources \
--stack-name $USAGE_REPORT_OPERATOR_NAME \
--query 'StackResources[?ResourceType==`AWS::Athena::WorkGroup`].PhysicalResourceId' \
--output text)
# verification
echo $USAGE_REPORT_DATABASE
echo $DATABASE_WORKGROUP_NAMETo generate a usage report and export it to a specified S3 bucket, provide the following parameters to the run.py Python script:
| Parameter | Description | Example Value | Required |
|---|---|---|---|
| --start-date | Beginning date for report data | 2025-04-15 |
Yes |
| --end-date | Ending date for report data | 2025-04-17 |
Yes |
| --format | Output format of the report | csv or pdf |
Yes |
| --database-name | Name of the database to query | usage_report |
Yes |
| --database-workgroup-name | Name of Athena's workgroup | usage_report_workgroup |
Yes |
| --type | Type of report to generate | detailed or summary |
Yes |
| --output-report-location | Directory where report will be saved | s3://bucket-name/path |
Yes |
| --cluster-name | Name of the HyperPod cluster | my-hyperpod-cluster |
Yes |
Note:
-
Select a date range that falls within the previous 180 days from the current date (unless you customized the
DataRententionDayswhen installing the CloudFormation stack). -
A good practice is to create a separate folder in your S3 bucket to serve as the destination for generated usage reports.
Use the following command to generate and export the report:
python run.py \
--start-date <Start date of the report, i.e. 2025-04-22> \
--end-date <End date of the report, i.e. 2025-04-22> \
--format <csv or pdf> \
--database-name $USAGE_REPORT_DATABASE \
--database-workgroup-name $DATABASE_WORKGROUP_NAME \
--type <detailed or summary> \
--output-report-location s3://$USAGE_REPORT_S3_BUCKET/<usage report output folder> \
--cluster-name $HYPERPOD_CLUSTER_NAMENote
- Ensure that the S3 bucket specified in
--output-report-locationhas the necessary permissions to accept the report files. - The
cluster-nameshould match the name of your SageMaker HyperPod cluster. - You can find all original captured data in the
rawdirectory of your S3 bucket$USAGE_REPORT_S3_BUCKET/rawor in the Athena console.
The output file follows the naming convention: <report-type>-report-<start-date>-<end-date>.<format>.
For example, a summary report for the dates April 15, 2025, to April 17, 2025, in CSV format will be named summary-report-2025-04-15-2025-04-17.csv and will be located in the specified output directory --output-report-location of your S3 bucket.
When you no longer need your SageMaker HyperPod usage reporting infrastructure, follow these steps to clean up Kubernetes and AWS resources (in that order). Proper resource deletion helps prevent unnecessary costs.
To uninstall the Helm chart, run the following command:
cd sagemaker-hyperpod-usage-report/helm_chart
helm uninstall $USAGE_REPORT_OPERATOR_NAME --namespace $USAGE_REPORT_OPERATOR_NAMEEnsure that you uninstalled the SageMaker HyperPod usage report Kubernetes operator:
kubectl get pods --namespace $USAGE_REPORT_OPERATOR_NAMETo delete the CloudFormation stack and the resources it created, run the following command:
aws cloudformation delete-stack --region $AWS_REGION --stack-name $USAGE_REPORT_OPERATOR_NAMEEnsure that the stack is properly deleted:
aws cloudformation describe-stacks --region $AWS_REGION --stack-name $USAGE_REPORT_OPERATOR_NAME \
--region $AWS_REGION --query 'Stacks[0].StackStatus' --output textNote: To prevent accidental deletion, you should delete the S3 buckets created by the CloudFormation stack manually:
$USAGE_REPORT_S3_BUCKET
To run the unit tests locally:
cd report_generation
pytestThis will execute all test cases in the test directory. The test suite includes unit tests for all major components of the usage report functionality.
See ./attributions for credits.
See CONTRIBUTING for more information.
This project is licensed under the Apache-2.0 License.