Skip to content

Commit 15b97d7

Browse files
committed
job preparation sequence diagram + some fixs
1 parent f06f4da commit 15b97d7

File tree

4 files changed

+251
-22
lines changed

4 files changed

+251
-22
lines changed

docs/architecture/architecture.md

+142
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
# Architecture
2+
3+
## Current architecture - Overview
4+
5+
```mermaid
6+
flowchart LR
7+
8+
subgraph UL["User laptop"]
9+
HPCSC["HPCS Client"]
10+
ULSPA["Spire Agent"]
11+
end
12+
13+
subgraph SS["Supercomputing Site"]
14+
subgraph CN["Compute node"]
15+
CNSPA["Spire Agent"]
16+
SBATCH["Sbatch"]
17+
end
18+
LN["Login node"]
19+
end
20+
21+
subgraph UN["Utility node"]
22+
subgraph k8s["Kubernetes cluster"]
23+
SPS["Spire Server"]
24+
HPCSS["HPCS Server"]
25+
SPA["Spire Agent"]
26+
Vault
27+
end
28+
end
29+
30+
UL <--"SSH"--> LN
31+
LN <--"Scheduling"--> CN
32+
UL <--"HTTPS (HPCS), HTTPS (Vault), TCP (Spire)"--> UN
33+
CN <--"HTTPS (HPCS), HTTPS (Vault), TCP (Spire)"--> UN
34+
```
35+
36+
## Current architecture - In depth
37+
38+
```mermaid
39+
flowchart LR
40+
41+
subgraph UL["User laptop"]
42+
43+
subgraph HPCSCDP["Data preparation"]
44+
HPCSCDPB["HPCS Client"]
45+
SPADP["Spire Agent"]
46+
end
47+
48+
subgraph HPCSCCP["Container preparation"]
49+
HPCSCCPB["HPCS Client"]
50+
SPACP["Spire Agent"]
51+
end
52+
subgraph HPCSCJP["HPCS Client - Job preparation"]
53+
HPCSCJPB["HPCS Client"]
54+
end
55+
end
56+
57+
subgraph SS["Supercomputing Site"]
58+
SC["Slurm Controller"]
59+
LN["Login nodes"]
60+
subgraph PCPU["CPU Partition"]
61+
subgraph CN1["Compute node 1"]
62+
CN1SBATCH["Sbatch"]
63+
CN1SA["Spire Agent"]
64+
end
65+
subgraph CN2["Compute node 2"]
66+
CN2SBATCH["Sbatch"]
67+
CN2SA["Spire Agent"]
68+
end
69+
end
70+
subgraph PGPU["GPU Partition"]
71+
subgraph CN3["Compute node 3"]
72+
CN3SBATCH["Sbatch"]
73+
CN3SA["Spire Agent"]
74+
end
75+
subgraph CN4["Compute node 4"]
76+
CN4SBATCH["Sbatch"]
77+
CN4SA["Spire Agent"]
78+
end
79+
end
80+
end
81+
82+
subgraph UN["Utility node"]
83+
subgraph k8s["Kubernetes cluster"]
84+
subgraph HPCSP["HPCS Pod"]
85+
SPS["Spire Server"]
86+
subgraph HPCSSC["HPCS Server Container"]
87+
HPCSS["HPCS Server"]
88+
SPA["Spire Agent"]
89+
end
90+
SPO["Spire OIDC"]
91+
NI["Nginx Ingress"]
92+
end
93+
Vault
94+
end
95+
end
96+
97+
SPS <--"UNIX Socket"--> SPO
98+
SPO <--"UNIX Socket"--> NI
99+
100+
HPCSS <--"CLI + UNIX Socket"--> SPS
101+
HPCSS <--"PYSPIFFE (UNIX SOCKET)"--> SPA
102+
103+
SPA <--TCP--> SPS
104+
105+
Vault <--"HTTPS"--> NI
106+
Vault <--"HTTPS (mTLS)"--> HPCSS
107+
108+
LN <--"CLI"--> SC
109+
110+
SC <--"Scheduling"--> PCPU
111+
SC <--"Scheduling"--> PGPU
112+
113+
SPADP <--"TCP"--> SPS
114+
SPACP <--"TCP"--> SPS
115+
116+
HPCSCDPB <--"HTTPS (mTLS)"--> HPCSS
117+
HPCSCCPB <--"HTTPS (mTLS)"--> HPCSS
118+
119+
HPCSCDPB <--"HTTPS"--> Vault
120+
HPCSCCPB <--"HTTPS"--> Vault
121+
122+
HPCSCCPB <--"CLI/Lib + UNIX Socket"--> SPACP
123+
HPCSCDPB <--"CLI/Lib + UNIX Socket"--> SPADP
124+
125+
CN1SA <--"TCP"--> SPS
126+
CN2SA <--"TCP"--> SPS
127+
CN3SA <--"TCP"--> SPS
128+
CN4SA <--"TCP"--> SPS
129+
130+
CN1SBATCH <--"HTTPS"--> Vault
131+
CN2SBATCH <--"HTTPS"--> Vault
132+
CN3SBATCH <--"HTTPS"--> Vault
133+
CN4SBATCH <--"HTTPS"--> Vault
134+
135+
HPCSCDPB <--"SSH (As user - Data & Info files)"--> LN
136+
HPCSCCPB <--"SSH (As user - Container image & Info files)"--> LN
137+
138+
HPCSCJPB --"SSH (As user - SBATCH file & CLI Call to SBATCH)"--> LN
139+
LN --"SSH (As user - Info files)"--> HPCSCJPB
140+
```
141+
142+
This diagram doesn't show the HTTPS requests from client/compute node to HPCS Server used to register the agents since this behaviour is a practical workaround.

docs/architecture/container_preparation.md

+17-12
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ This step consist in using an original OCI image to prepare it, encrypt it and s
66

77
```mermaid
88
sequenceDiagram
9+
actor User
910
User -->> Container Preparation container: spawns using docker-compose
1011
Container Preparation container -->> Spire Agent: spawns using `spawn_agent.py`
1112
Spire Agent ->> Spire Server: Runs node attestation
@@ -16,35 +17,37 @@ sequenceDiagram
1617
Container Preparation container ->> Vault: Log-in using SVID
1718
Vault ->> Container Preparation container: Returns an authentication token (write only on client's path)
1819
Container Preparation container ->> Vault: Write private key using authentication token
19-
Vault ->> Container Preparation container:
20+
Vault ->> Container Preparation container:
2021
Container Preparation container ->> HPCS Server: Request creation of workloads (compute nodes, users, groups ...) authorized to access the key and using SVID to authenticate
2122
HPCS Server ->> Spire Server: Validate SVID
22-
Spire Server ->> HPCS Spire Agent:
23+
Spire Server ->> HPCS Spire Agent:
2324
HPCS Spire Agent ->> Spire Server: Validate SVID
24-
Spire Server ->> HPCS Server:
25+
Spire Server ->> HPCS Server:
2526
HPCS Server ->> Spire Server: Create workloads identities to access the key
26-
Spire Server ->> HPCS Server:
27+
Spire Server ->> HPCS Server:
2728
HPCS Server ->> Vault: Create role and policy to access the key
28-
Vault ->> HPCS Server:
29+
Vault ->> HPCS Server:
2930
HPCS Server ->> Container Preparation container: SpiffeID & role to access the container, path to the secret
3031
Container Preparation container ->> Container Preparation container: Parse info file based on previous steps
3132
Container Preparation container ->> Supercomputer: Ship encrypted container
32-
Supercomputer ->> Container Preparation container:
33+
Supercomputer ->> Container Preparation container:
3334
Container Preparation container ->> Supercomputer: Ship info file
34-
Supercomputer ->> Container Preparation container:
35+
Supercomputer ->> Container Preparation container:
3536
Container Preparation container -->> Spire Agent: Kills
36-
Spire Agent -->> Container Preparation container:
37-
Spire Agent -->> Container Preparation container: Dies
37+
Spire Agent -->> Container Preparation container:
38+
Spire Agent -->> Container Preparation container: Dies
3839
Container Preparation container -->> User: Finishes
3940
```
4041

41-
4242
## Sequence diagram of the container's preparation (without shipping)
4343

4444
### Image is prepared and then encrypted (Encryption at rest)
45+
4546
This step is currently (3/2024) used to encrypt the container. It does not require changes on LUMI to work.
47+
4648
```mermaid
4749
sequenceDiagram
50+
actor User
4851
User -->>HPCS Client: spawns using `python3 prepare_container.py [OPTIONS]`
4952
HPCS Client -->> Docker Client: spawns
5053
HPCS Client ->> HPCS Client: Create prepared Dockerfile
@@ -59,11 +62,13 @@ sequenceDiagram
5962
HPCS Client ->> HPCS Client: Encrypt image file
6063
```
6164

62-
6365
### Image is prepared and SIF encrypted
66+
6467
When HPC nodes support encrypted containers, this process can be used.
68+
6569
```mermaid
6670
sequenceDiagram
71+
actor User
6772
User -->>HPCS Client: spawns using `python3 prepare_container.py [OPTIONS]`
6873
HPCS Client -->> Docker Client: spawns
6974
HPCS Client ->> HPCS Client: Create prepared Dockerfile
@@ -75,4 +80,4 @@ sequenceDiagram
7580
Docker Client -->> Build-Env: Spawns
7681
Build-Env ->> Build-Env: Build final prepared and encrypted SIF image
7782
Build-Env ->> HPCS Client: Returns final prepared and encrypted SIF image
78-
```
83+
```

docs/architecture/data_preparation.md

+11-10
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ This step consists in using an input directory, encrypt it and ship it to the su
66

77
```mermaid
88
sequenceDiagram
9+
actor User
910
User -->> Data Preparation container: spawns using docker-compose
1011
Data Preparation container -->> Spire Agent: spawns using `spawn_agent.py`
1112
Spire Agent ->> Spire Server: Runs node attestation
@@ -16,24 +17,24 @@ sequenceDiagram
1617
Data Preparation container ->> Vault: Log-in using SVID
1718
Vault ->> Data Preparation container: Returns an authentication token (write only on client's path)
1819
Data Preparation container ->> Vault: Write private key using authentication token
19-
Vault ->> Data Preparation container:
20+
Vault ->> Data Preparation container:
2021
Data Preparation container ->> HPCS Server: Request creation of workloads (compute nodes, users, groups ...) authorized to access the key and using SVID to authenticate
2122
HPCS Server ->> Spire Server: Validate SVID
22-
Spire Server ->> HPCS Spire Agent:
23+
Spire Server ->> HPCS Spire Agent:
2324
HPCS Spire Agent ->> Spire Server: Validate SVID
24-
Spire Server ->> HPCS Server:
25+
Spire Server ->> HPCS Server:
2526
HPCS Server ->> Spire Server: Create workloads identities to access the key
26-
Spire Server ->> HPCS Server:
27+
Spire Server ->> HPCS Server:
2728
HPCS Server ->> Vault: Create role and policy to access the key
28-
Vault ->> HPCS Server:
29+
Vault ->> HPCS Server:
2930
HPCS Server ->> Data Preparation container: SpiffeID & role to access the container, path to the secret
3031
Data Preparation container ->> Data Preparation container: Parse info file based on previous steps
3132
Data Preparation container ->> Supercomputer: Ship encrypted containe
32-
Supercomputer ->> Data Preparation container:
33+
Supercomputer ->> Data Preparation container:
3334
Data Preparation container ->> Supercomputer: Ship info file
34-
Supercomputer ->> Data Preparation container:
35+
Supercomputer ->> Data Preparation container:
3536
Data Preparation container -->> Spire Agent: Kills
36-
Spire Agent -->> Data Preparation container:
37-
Spire Agent -->> Data Preparation container: Dies
37+
Spire Agent -->> Data Preparation container:
38+
Spire Agent -->> Data Preparation container: Dies
3839
Data Preparation container -->> User: Finishes
39-
```
40+
```

docs/architecture/job_preparation.md

+81
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# Job preparation
2+
3+
This step consists in the preparation of the secure job, followed by its execution. It requires two info files (one for the data, one for the secured container) and more settings about the runtime (arguments, parameters for the singularity container ...).
4+
5+
## Sequence diagram of this step
6+
7+
```mermaid
8+
sequenceDiagram
9+
actor User
10+
participant Job Preparation container
11+
participant Login Node
12+
participant Scheduler
13+
14+
User -->> Job Preparation container: spawns using docker-compose
15+
Job Preparation container ->> Login Node: Initiate SSH Connection
16+
rect rgb(191, 223, 255)
17+
note right of User: Job preparation
18+
Job Preparation container ->> Login Node: SCP Data's info file
19+
Login Node ->> Job Preparation container: Info file
20+
Job Preparation container ->> Job Preparation container: Parse info from info file
21+
Job Preparation container ->> Login Node: SCP Container image's info file
22+
Login Node ->> Job Preparation container: Info file
23+
Job Preparation container ->> Job Preparation container: Parse info from info file
24+
Job Preparation container ->> Job Preparation container: Generate SBATCH file from template based on info gathered
25+
Job Preparation container ->> Login Node: Copy SBATCH File and HPCS Configuration file
26+
Login Node ->> Job Preparation container:
27+
Job Preparation container ->> Job Preparation container: Generate keypair for output data
28+
Job Preparation container ->> Login Node: Copy encryption key
29+
Login Node ->> Job Preparation container:
30+
end
31+
32+
rect rgb(191, 223, 255)
33+
note right of User: Job runtime
34+
Job Preparation container ->> Login Node: SSH Execute "sbatch SBATCHFILE"
35+
Login Node ->>+ Scheduler: sbatch SBATCHFILE
36+
Scheduler ->> Login Node: Job created + Job id
37+
Login Node ->> Job Preparation container: Job created + Job id
38+
Job Preparation container ->> Job Preparation container: Follows job output or job status
39+
activate Job Preparation container
40+
Scheduler ->> Scheduler: Scheduling job
41+
activate Scheduler
42+
deactivate Scheduler
43+
Scheduler ->> Compute node: Elect node - Execute SBATCHFILE
44+
Compute node ->> Compute node: Clone HPCS Github / Download age and gocryptfs binaries
45+
Compute node -->> Spire Agent: spawns using `spawn_agent.py`
46+
Spire Agent ->> Spire Server: Runs node attestation
47+
Spire Server ->> Spire Agent: Attests node, provide SVIDs for linked identities
48+
Compute node ->> Spire Agent: Fetches API to get an SVID
49+
Spire Agent ->> Compute node: Provides SVID
50+
Compute node ->> Vault: Log-in using SVID
51+
Vault ->> Compute node: Returns an authentication token (read only on container key's path)
52+
Compute node ->> Vault: Read container's key using authentication token
53+
Vault ->> Compute node: Returns container's key
54+
Compute node ->> Compute node: Decrypt container image
55+
Compute node ->> Compute node: Setup secure environment for runtime (Encrypted volumes, gather flags etc)
56+
Compute node ->> Spire Agent: Fetches API to get an SVID
57+
Spire Agent ->> Compute node: Provides SVID
58+
Compute node ->> Compute node: Export SVID and data secret path in a variable
59+
Compute node -->> Application container: spawns using `singularity run`
60+
Application container ->> Vault: Log-in using SVID
61+
Vault ->> Application container: Returns an authentication token (read only on data key's path)
62+
Application container ->> Vault: Read data's key using authentication token
63+
Vault ->> Application container: Returns data's key
64+
Application container ->> Application container: Decrypt data using key
65+
Application container ->> Application container: Runs input scripts
66+
Application container ->> Application container: Application runs
67+
Application container ->> Application container: Runs output scripts
68+
Application container ->> Application container: Encrypt output directory
69+
Application container -->> Compute node: Finishes
70+
Compute node -->> Spire Agent: Kills
71+
Spire Agent -->> Compute node:
72+
Spire Agent -->> Compute node: Dies
73+
Compute node ->> Scheduler: Becomes available
74+
deactivate Job Preparation container
75+
end
76+
Job Preparation container ->> Login Node: Close SSH connection
77+
Login Node ->> Job Preparation container:
78+
Login Node ->> Job Preparation container: Close SSH connection
79+
80+
Job Preparation container -->> User: Finishes
81+
```

0 commit comments

Comments
 (0)