Slurm nodes installation under Ubuntu16.04#2
Slurm nodes installation under Ubuntu16.04#2jianlianggao wants to merge 9 commits intoBioMedIA:masterfrom
Conversation
The files can be used for installing Slurm nodes with Ubuntu 16.04. The Slurm master node is biomedia03
jopasserat
left a comment
There was a problem hiding this comment.
Couple of changes, most critical are:
- use Jinja templates
- use Pillar
files/etc/slurm-llnl/slurm.conf
Outdated
| {% set rocsList = [] %} | ||
| {% for node, values in pillar['slurm']['nodes']['batch']['cpus'].items() %} {% if node.startswith('roc') %} {% set rocsListTrash = rocsList.append(node) %} {% endif %} {% endfor %} | ||
| NodeName=biomedia01 RealMemory=64000 CPUs=24 State=UNKNOWN | ||
|
|
There was a problem hiding this comment.
this should be part of the Jinja template
files/etc/slurm-llnl/slurm.conf
Outdated
| NodeName=monal01 RealMemory=80000 CPUs=12 Gres=gpu:8 State=UNKNOWN | ||
|
|
||
| # Partitions | ||
| PartitionName=long Nodes=biomedia01,biomedia02,biomedia05,biomedia06,biomedia07,biomedia08,biomedia09,biomedia10,roc01,roc02,roc03 Default=YES MaxTime=43200 |
There was a problem hiding this comment.
all roc machines should be in long as well. it's fine to have two partitions overlapping
| #AuthInfo=/var/run/munge/munge.socket.2 | ||
| AuthType=auth/munge | ||
| DbdHost={{ pillar['slurm']['controller'] }} | ||
| DbdHost=biomedia03 |
There was a problem hiding this comment.
no hardcoded values please => use Pillar
| {% endif %} | ||
| install slurm packages from local repo: | ||
| pkg.installed: | ||
| - sources: |
| # disable SSH for anybody but root | ||
| +:root:ALL | ||
| -:ALL EXCEPT (csg) dr jpassera bglocker:ALL | ||
| -:ALL EXCEPT (csg) (biomedia) dr jpassera bglocker jgao:ALL |
There was a problem hiding this comment.
regular users shouldn't have SSH access to the cluster nodes, hence the previous config
| ConstrainSwapSpace=yes | ||
| AllowedSwapSpace=10.0 | ||
| # Not well supported until Slurm v14.11.4 https://groups.google.com/d/msg/slurm-devel/oKAUed7AETs/Eb6thh9Lc0YJ | ||
| #ConstrainDevices=yes |
There was a problem hiding this comment.
Should that be enabled and not commented out then?
files/etc/slurm-llnl/slurm.conf
Outdated
| # | ||
| # Workaround because Slurm does not recognize full hostname... | ||
| ControlMachine={{ pillar['slurm']['controller'] }} | ||
| ControlMachine=biomedia03 |
| @@ -1,3 +0,0 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
This script is useful to add new members. it should stay there
| @@ -1,13 +0,0 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
Script used to bootstrap new minions. Any reason to remove it?
| * Slurm | ||
| * Slurm Database | ||
| * SSH | ||
| To install Slurm nodes, you need to copy (on Slurm mater node) |
There was a problem hiding this comment.
Maybe move that lower in an "Instructions" subsection instead of replacing what the formula contains?
|
Hi Jonathan,
Thank you very much for the comments. I will modify and test on predicts cluster.
Best wishes,
Jianliang
________________________________
From: jopasserat <notifications@github.com>
Sent: 30 April 2018 17:02:34
To: BioMedIA/slurm-formula
Cc: Gao, Jianliang; Author
Subject: Re: [BioMedIA/slurm-formula] Slurm nodes installation under Ubuntu16.04 (#2)
@jopasserat requested changes on this pull request.
Couple of changes, most critical are:
* use Jinja templates
* use Pillar
________________________________
In files/etc/slurm-llnl/slurm.conf<#2 (comment)>:
-{% set rocsList = [] %}
-{% for node, values in pillar['slurm']['nodes']['batch']['cpus'].items() %} {% if node.startswith('roc') %} {% set rocsListTrash = rocsList.append(node) %} {% endif %} {% endfor %}
+NodeName=biomedia01 RealMemory=64000 CPUs=24 State=UNKNOWN
+
this should be part of the Jinja template
________________________________
In files/etc/slurm-llnl/slurm.conf<#2 (comment)>:
+NodeName=roc11 RealMemory=257869 CPUs=32 State=UNKNOWN
+
+NodeName=roc12 RealMemory=257869 CPUs=32 State=UNKNOWN
+
+NodeName=roc13 RealMemory=257869 CPUs=32 State=UNKNOWN
+
+NodeName=roc14 RealMemory=257869 CPUs=32 State=UNKNOWN
+
+NodeName=roc15 RealMemory=257869 CPUs=32 State=UNKNOWN
+
+NodeName=roc16 RealMemory=257869 CPUs=32 State=UNKNOWN
+
+NodeName=monal01 RealMemory=80000 CPUs=12 Gres=gpu:8 State=UNKNOWN
+
+# Partitions
+PartitionName=long Nodes=biomedia01,biomedia02,biomedia05,biomedia06,biomedia07,biomedia08,biomedia09,biomedia10,roc01,roc02,roc03 Default=YES MaxTime=43200
all roc machines should be in long as well. it's fine to have two partitions overlapping
________________________________
In files/etc/slurm-llnl/slurmdbd.conf<#2 (comment)>:
@@ -6,7 +6,7 @@ ArchiveSuspend=no
#ArchiveScript=/usr/sbin/slurm.dbd.archive
#AuthInfo=/var/run/munge/munge.socket.2
AuthType=auth/munge
-DbdHost={{ pillar['slurm']['controller'] }}
+DbdHost=biomedia03
no hardcoded values please => use Pillar
________________________________
In init.sls<#2 (comment)>:
-/var/log/slurm-llnl/sched.log:
- file.managed:
- - group: slurm
- - user: slurm
- - require:
- - user: slurm
-{% endif %}
+install slurm packages from local repo:
+ pkg.installed:
+ - sources:
nice 👍
________________________________
In files/etc/security/access.conf<#2 (comment)>:
@@ -9,4 +9,4 @@
# who wants to be able to SSH in as root via public-key on Biomedia servers.
# disable SSH for anybody but root
+:root:ALL
--:ALL EXCEPT (csg) dr jpassera bglocker:ALL
+-:ALL EXCEPT (csg) (biomedia) dr jpassera bglocker jgao:ALL
regular users shouldn't have SSH access to the cluster nodes, hence the previous config
________________________________
In files/etc/slurm-llnl/cgroup.conf<#2 (comment)>:
@@ -16,7 +16,8 @@ CgroupReleaseAgentDir=/var/spool/slurm-llnl/cgroup
ConstrainCores=yes
TaskAffinity=yes
#ConstrainRAMSpace=no
-### not used yet
-#ConstrainDevices=no
-#AllowedDevicesFile=/etc/slurm-llnl/cgroup_allowed_devices_file.conf
-
+ConstrainSwapSpace=yes
+AllowedSwapSpace=10.0
+# Not well supported until Slurm v14.11.4 https://groups.google.com/d/msg/slurm-devel/oKAUed7AETs/Eb6thh9Lc0YJ
+#ConstrainDevices=yes
Should that be enabled and not commented out then?
________________________________
In files/etc/slurm-llnl/slurm.conf<#2 (comment)>:
@@ -3,7 +3,7 @@
# See the slurm.conf man page for more information.
#
# Workaround because Slurm does not recognize full hostname...
-ControlMachine={{ pillar['slurm']['controller'] }}
+ControlMachine=biomedia03
Pillar
________________________________
In tools/add_users_to_slurm_groups.sh<#2 (comment)>:
@@ -1,3 +0,0 @@
-#!/bin/bash
This script is useful to add new members. it should stay there
________________________________
In tools/install_salt-minion.sh<#2 (comment)>:
@@ -1,13 +0,0 @@
-#!/bin/bash
Script used to bootstrap new minions. Any reason to remove it?
________________________________
In README.md<#2 (comment)>:
@@ -3,9 +3,9 @@
Salt formula provisioning a Slurm cluster
…-Availables states:
- * Munge
- * Screen
- * Slurm
- * Slurm Database
- * SSH
+To install Slurm nodes, you need to copy (on Slurm mater node)
Maybe move that lower in an "Instructions" subsection instead of replacing what the formula contains?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#2 (review)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ARkEoAC125GdeIqtpcT2yBx_p7wBdqzbks5ttzWagaJpZM4TsphH>.
|
files/etc/slurm-llnl/slurm.conf
Outdated
| PartitionName=rocsShort Nodes={{ ','.join(rocsList) }} Default=NO MaxTime=60 Priority=5000 | ||
| #PartitionName=long Nodes=biomedia01,biomedia02,biomedia03,biomedia05 Default=YES MaxTime=43200 | ||
| #PartitionName=short Nodes=biomedia01,biomedia03,biomedia05 Default=NO MaxTime=60 Priority=5000 | ||
| PartitionName=gpus Nodes=monal01 Default=NO MaxTime=10080 MaxCPUsPerNode=4 MaxMemPerNode=30720 |
There was a problem hiding this comment.
You can remove the MaxCPUsPerNode=4 MaxMemPerNode=30720 settings. Just legacy code here
|
Hi Jonathan,
OK, thank you very much. I haven't done the modification yet. I'm now testing a shell script to submit docker run jobs via slurm.
Best wishes,
Jianliang
Sent from my Mobile
-------- Original Message --------
Subject: Re: [BioMedIA/slurm-formula] Slurm nodes installation under Ubuntu16.04 (#2)
From: jopasserat
To: BioMedIA/slurm-formula
CC: "Gao, Jianliang" ,Author
@jopasserat commented on this pull request.
________________________________
In files/etc/slurm-llnl/slurm.conf<#2 (comment)>:
-PartitionName=rocsLong Nodes={{ ','.join(rocsList) }} Default=NO MaxTime=43200
-PartitionName=rocsShort Nodes={{ ','.join(rocsList) }} Default=NO MaxTime=60 Priority=5000
+#PartitionName=long Nodes=biomedia01,biomedia02,biomedia03,biomedia05 Default=YES MaxTime=43200
+#PartitionName=short Nodes=biomedia01,biomedia03,biomedia05 Default=NO MaxTime=60 Priority=5000
+PartitionName=gpus Nodes=monal01 Default=NO MaxTime=10080 MaxCPUsPerNode=4 MaxMemPerNode=30720
You can remove the MaxCPUsPerNode=4 MaxMemPerNode=30720 settings. Just legacy code here
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#2 (review)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ARkEoHSzssmEsrvy-L9Dohsn3mR345GYks5tuMUhgaJpZM4TsphH>.
|
Updated slurm.conf & pillar.example using jinja template
| StoragePass={{ pillar['slurm']['db']['password'] }} | ||
| StorageLoc=slurmdb | ||
| StorageUser=slurm | ||
| StoragePass=1BUy4eVv7X No newline at end of file |
There was a problem hiding this comment.
password in cleartext in the commit history...
The files can be used for installing Slurm nodes with Ubuntu 16.04. The Slurm master node is biomedia03