Skip to content

Commit e8e4cbe

Browse files
authored
Merge pull request #324 from xylar/improve-sync-docs
Add docs for `mache sync diags`
2 parents 6cbde9f + e06f36d commit e8e4cbe

File tree

2 files changed

+180
-0
lines changed

2 files changed

+180
-0
lines changed

docs/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
1212
users_guide/quick_start
1313
users_guide/spack/build
14+
users_guide/sync/diags
1415
```
1516

1617
```{toctree}

docs/users_guide/sync/diags.md

Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
# Synchronize diagnostics between machines (`mache sync diags`)
2+
3+
This command copies precomputed E3SM diagnostics (both public and private)
4+
between supported HPC systems using rsync. A common use is to pull diagnostics
5+
stored on the LCRC filesystem (Chrysalis) down to another site so local
6+
post-processing and plotting tools can find them.
7+
8+
The command supports two directions:
9+
10+
- `from <other>`: Pull diagnostics from an LCRC machine (e.g., `chrysalis`) to
11+
your current machine
12+
- `to <other>`: Push diagnostics from your current LCRC machine to another
13+
machine
14+
15+
Important constraints:
16+
17+
- LCRC machines are `anvil` and `chrysalis`
18+
- You may only:
19+
- run `to` when you are currently on an LCRC machine; and
20+
- run `from` when the other machine is an LCRC machine.
21+
- If you try to sync between two different LCRC machines, you'll be told to
22+
sync with the same machine instead, since the files are local/shared.
23+
- It is *highly* recommended that you sync `from` an LCRC machine to another
24+
HPC system because this allows permissions to be updated after the sync.
25+
26+
---
27+
28+
## Prerequisites
29+
30+
- A valid LCRC/CELS account and username (used below as `<cels_username>`)
31+
- SSH key-based access configured for each machine from which you will run the
32+
sync
33+
- `mache` installed and configured on the machine where you run the command
34+
35+
### 1) Generate an SSH key (if you don’t already have one)
36+
37+
On each HPC machine where you plan to run the sync:
38+
39+
```bash
40+
ssh-keygen -t ed25519
41+
```
42+
43+
Accept the default path (`~/.ssh/id_ed25519`) unless you have a reason to use a
44+
different one. Don’t share your private key.
45+
46+
### 2) Add your public key to your CELS account
47+
48+
- Copy the content of your public key (typically `~/.ssh/id_ed25519.pub`).
49+
- Visit https://accounts.cels.anl.gov/ and add it under your account’s SSH
50+
keys.
51+
- Give it a descriptive name (e.g., `andes`, `frontier`, `compy`).
52+
- Allow a few minutes for the new key to propagate.
53+
54+
### 3) Configure your `~/.ssh/config`
55+
56+
We recommend a control connection and a short host alias for Chrysalis:
57+
58+
```ini
59+
Host *
60+
ControlMaster auto
61+
ControlPath ~/.ssh/connections/%r@%h:%p
62+
ServerAliveInterval 300
63+
ServerAliveCountMax 3
64+
65+
Host chrys
66+
Hostname chrysalis.lcrc.anl.gov
67+
User <cels_username>
68+
ProxyJump <cels_username>@logins.lcrc.anl.gov
69+
```
70+
71+
Also create the connections directory if it doesn’t exist:
72+
73+
```bash
74+
mkdir -p ~/.ssh/connections
75+
chmod 700 ~/.ssh ~/.ssh/connections
76+
```
77+
78+
#### OLCF (Andes, Frontier) extra settings
79+
80+
Some OLCF systems require explicit auth options. Add these lines to the
81+
`chrys` host in your SSH config if you’re on Andes/Frontier:
82+
83+
```ini
84+
Host chrys
85+
Hostname chrysalis.lcrc.anl.gov
86+
User <cels_username>
87+
ProxyJump <cels_username>@logins.lcrc.anl.gov
88+
IdentityFile ~/.ssh/id_ed25519
89+
PreferredAuthentications publickey,keyboard-interactive
90+
PasswordAuthentication no
91+
```
92+
93+
---
94+
95+
## Recommended workflow
96+
97+
1) Start a background control connection to Chrysalis (you’ll be prompted for
98+
Duo):
99+
100+
```bash
101+
ssh -MNf chrys
102+
```
103+
104+
You should be returned to your original login shell after MFA.
105+
106+
2) Run the sync. For example, to pull diagnostics from Chrysalis to your
107+
current machine:
108+
109+
```bash
110+
mache sync diags from chrysalis -u <cels_username>
111+
```
112+
113+
If the control connection is active, you shouldn’t be prompted for Duo again.
114+
You’ll see `rsync` output similar to:
115+
116+
```
117+
running: rsync --verbose --recursive --times --links --compress --progress --update --no-perms --omit-dir-times <cels_username>@chrysalis.lcrc.anl.gov:/lcrc/group/e3sm/public_html/diagnostics/ /path/to/local/diagnostics
118+
receiving incremental file list
119+
grids/ocean.RRSwISC6to18E3r5.mask.scrip.20240327.nc
120+
633,767,353 100% 16.58MB/s 0:00:36 (xfr#1, ir-chk=1293/1488)
121+
grids/ocean.RRSwISC6to18E3r5.nomask.scrip.20240327.nc
122+
633,767,353 100% 26.88MB/s 0:00:22 (xfr#2, ir-chk=1292/1488)
123+
...
124+
```
125+
126+
3) When you’re done, close the control connection:
127+
128+
```bash
129+
ssh -O exit chrys
130+
```
131+
132+
Notes:
133+
- When pulling data (`from`), `mache` will automatically fix permissions on
134+
the local destination according to machine settings.
135+
- Destination paths are derived from your machine configuration (diagnostics
136+
base path), and source paths from the LCRC machine configuration.
137+
138+
---
139+
140+
## Command reference
141+
142+
Basic usage:
143+
144+
```text
145+
mache sync diags to <other> [-u <username>] [-m <this_machine>] [-f <config_file>]
146+
mache sync diags from <other> [-u <username>] [-m <this_machine>] [-f <config_file>]
147+
```
148+
149+
- `to | from` — direction of sync
150+
- `<other>` — the other machine name (e.g., `chrysalis`)
151+
- `-u, --username` — the username to use on the other machine (required in
152+
practice)
153+
- `-m, --machine` — explicitly set the name of the current machine
154+
(auto-detected if omitted)
155+
- `-f, --config_file` — path to a config file that overrides defaults for the
156+
current machine
157+
158+
Constraints enforced by the command:
159+
160+
- Only `anvil`/`chrysalis` are considered LCRC machines
161+
- `to` is only allowed when you are on an LCRC machine
162+
- `from` is only allowed when the other machine is an LCRC machine
163+
- Do not attempt to sync between two different LCRC machines (there is no need
164+
and this wastes bandwidth)
165+
166+
---
167+
168+
## Troubleshooting
169+
170+
- You get Duo prompts during rsync:
171+
- Ensure the control connection is active (`ssh -MNf chrys`) and your
172+
`Host chrys` alias matches the command you used to connect.
173+
- Permission errors on the destination:
174+
- Verify your local diagnostics base path exists and that you have write
175+
access; `mache` adjusts group/world permissions on pull, but can’t create
176+
parent paths that don’t exist.
177+
- Connection fails through the login proxy:
178+
- Double-check `ProxyJump <cels_username>@logins.lcrc.anl.gov` and that your
179+
public key is present at https://accounts.cels.anl.gov/.

0 commit comments

Comments
 (0)