Skip to content

[Diagnostics] Add diagnostics suite with first tool to diagnose SLURM accounting setup#7336

Open
gmarciani wants to merge 3 commits intoaws:developfrom
gmarciani:wip/mgiacomo/diagnostics-0414-1
Open

[Diagnostics] Add diagnostics suite with first tool to diagnose SLURM accounting setup#7336
gmarciani wants to merge 3 commits intoaws:developfrom
gmarciani:wip/mgiacomo/diagnostics-0414-1

Conversation

@gmarciani
Copy link
Copy Markdown
Contributor

@gmarciani gmarciani commented Apr 14, 2026

Description of changes

Add diagnostics suite with first tool to diagnose SLURM accounting setup.

Note
We can skip the bad-url-suffix-checker because it is complaining about a comment with an example that contains the domain amazonaws.com, so harmless.

User Experience

The user uploads the diagnostics suite to the head node with a one-click script.
The deployment script returns the command to log directly into the folder to execute the diagnosis.

➜  bash util/diagnostics/deploy.sh --cluster-name accnt-3150-11-2 --region us-east-1 --ssh-key ~/.ssh/pem_keys/mgiacomo/mgiacomo.pem
[INFO] Retrieving head node connection info for cluster 'accnt-3150-11-2' in region 'us-east-1'...
[INFO] Head node IP: 44.195.87.177
[INFO] Default user: ec2-user
[INFO] Uploading /Volumes/workplace/aws-parallelcluster-dev/aws-parallelcluster/util/diagnostics to ec2-user@44.195.87.177:~/
... OMITTED OUTPUT ...
[INFO] Done. Files uploaded to /home/ec2-user/diagnostics/
[INFO] Installing requirements on head node...
... OMITTED OUTPUT ...
[INFO] Requirements installed successfully.
[INFO] Next steps: log into the head node and run the diagnostics scripts from ~/diagnostics/
[INFO]   ssh -i /Users/mgiacomo/.ssh/pem_keys/mgiacomo/mgiacomo.pem ec2-user@44.195.87.177 -t 'cd ~/diagnostics && bash -l'

The user logs into the head node in the diagnostics folder:

➜  ssh -i /Users/mgiacomo/.ssh/pem_keys/mgiacomo/mgiacomo.pem ec2-user@44.195.87.177 -t 'cd ~/diagnostics && bash -l'

This is the helper of the first diagnosis tool about SLURM acocunting:

[ec2-user@ip-27-6-37-106 diagnostics]$ ./diagnose-slurm-accounting.py --help
Usage: diagnose-slurm-accounting.py [OPTIONS]

  Diagnose SLURM accounting setup.

Options:
  --db-endpoint TEXT  Database endpoint. If not specified, determined from the
                      cluster configuration in S3.
  --db-port INTEGER   Database port. If not specified, determined from the
                      cluster configuration in S3.
  --db-user TEXT      Database user. If not specified, determined from the
                      cluster configuration in S3.
  --secret-arn TEXT   Secret ARN for the database password. If not specified,
                      determined from the cluster configuration in S3.
  --region TEXT       AWS region. If not specified, determined from the local
                      /etc/chef/dna.json file.
  -h, --help          Show this message and exit.

This is an example of diagnosis made for SLURM accounting:

[ec2-user@ip-27-6-37-106 diagnostics]$ ./diagnose-slurm-accounting.py
2026-04-14 21:53:25,249 INFO: Some arguments are missing. Attempting to determine values automatically...
/home/ec2-user/.local/lib/python3.9/site-packages/boto3/compat.py:89: PythonDeprecationWarning: Boto3 will no longer support Python 3.9 starting April 29, 2026. To continue receiving service updates, bug fixes, and security updates please upgrade to Python 3.10 or later. More information can be found here: https://aws.amazon.com/blogs/developer/python-support-policy-updates-for-aws-sdks-and-tools/
  warnings.warn(warning, PythonDeprecationWarning)
2026-04-14 21:53:25,270 INFO: Found credentials from IAM Role: accnt-3150-11-2-RoleHeadNode-3nZd4yHCV7QF
[✓] Downloaded cluster configuration from S3
2026-04-14 21:53:25,458 INFO: Database Endpoint: slurm-accounting-cluster-11.cluster-c1yheob1ikdf.us-east-1.rds.amazonaws.com
2026-04-14 21:53:25,458 INFO: Database Port: 3306
2026-04-14 21:53:25,458 INFO: Database User: clusteradmin
2026-04-14 21:53:25,458 INFO: Secret ARN: arn:aws:secretsmanager:us-east-1:319414405305:secret:AccountingClusterAdminSecre-mo0xsZT8XRA3-zIQDCe
2026-04-14 21:53:25,459 INFO: Region: us-east-1
[✓] Database endpoint reachability check
[✓] Database endpoint matches configuration
2026-04-14 21:53:25,537 INFO: Found credentials from IAM Role: accnt-3150-11-2-RoleHeadNode-3nZd4yHCV7QF
[✓] Secret is plain text password
[✓] Database user matches configuration
[✓] Database password matches secret
[✓] MySQL connection test
[✓] User clusteradmin has correct MySQL permissions
2026-04-14 21:53:25,889 INFO: Grants for user 'clusteradmin':
    GRANT USAGE ON *.* TO `clusteradmin`@`%`
    GRANT `rds_superuser_role`@`%` TO `clusteradmin`@`%`
[✓] No errors related to MySQL in slurmdbd logs
2026-04-14 21:53:25,954 INFO: All checks completed!

Tests

  • See user experience above, which has been tested on a cluster with slurm accounting enabled.

References

  • Link to impacted open issues.
  • Link to related PRs in other packages (i.e. cookbook, node).
  • Link to documentation useful to understand the changes.

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@gmarciani gmarciani changed the title Wip/mgiacomo/diagnostics 0414 1 [Diagnostics] Add diagnostics suite with first tool to diagnose SLURM accounting setup Apr 14, 2026
@gmarciani gmarciani added skip-changelog-update Disables the check that enforces changelog updates in PRs 3.x labels Apr 14, 2026
@gmarciani gmarciani force-pushed the wip/mgiacomo/diagnostics-0414-1 branch from a757e50 to 36caf02 Compare April 14, 2026 22:10
@gmarciani gmarciani added the skip-bad-url-suffix-check Skip the checks regarding the bad URL suffix label Apr 14, 2026
@gmarciani gmarciani marked this pull request as ready for review April 14, 2026 22:12
@gmarciani gmarciani requested review from a team as code owners April 14, 2026 22:12
@gmarciani gmarciani force-pushed the wip/mgiacomo/diagnostics-0414-1 branch from 36caf02 to d7af7de Compare April 14, 2026 22:14
@gmarciani gmarciani force-pushed the wip/mgiacomo/diagnostics-0414-1 branch from d7af7de to 3ac826a Compare April 14, 2026 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.x skip-bad-url-suffix-check Skip the checks regarding the bad URL suffix skip-changelog-update Disables the check that enforces changelog updates in PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants