|
| 1 | +# Synchronize diagnostics between machines (`mache sync diags`) |
| 2 | + |
| 3 | +This command copies precomputed E3SM diagnostics (both public and private) |
| 4 | +between supported HPC systems using rsync. A common use is to pull diagnostics |
| 5 | +stored on the LCRC filesystem (Chrysalis) down to another site so local |
| 6 | +post-processing and plotting tools can find them. |
| 7 | + |
| 8 | +The command supports two directions: |
| 9 | + |
| 10 | +- `from <other>`: Pull diagnostics from an LCRC machine (e.g., `chrysalis`) to |
| 11 | + your current machine |
| 12 | +- `to <other>`: Push diagnostics from your current LCRC machine to another |
| 13 | + machine |
| 14 | + |
| 15 | +Important constraints: |
| 16 | + |
| 17 | +- LCRC machines are `anvil` and `chrysalis` |
| 18 | +- You may only: |
| 19 | + - run `to` when you are currently on an LCRC machine; and |
| 20 | + - run `from` when the other machine is an LCRC machine. |
| 21 | +- If you try to sync between two different LCRC machines, you'll be told to |
| 22 | + sync with the same machine instead, since the files are local/shared. |
| 23 | +- It is *highly* recommended that you sync `from` an LCRC machine to another |
| 24 | + HPC system because this allows permissions to be updated after the sync. |
| 25 | + |
| 26 | +--- |
| 27 | + |
| 28 | +## Prerequisites |
| 29 | + |
| 30 | +- A valid LCRC/CELS account and username (used below as `<cels_username>`) |
| 31 | +- SSH key-based access configured for each machine from which you will run the |
| 32 | + sync |
| 33 | +- `mache` installed and configured on the machine where you run the command |
| 34 | + |
| 35 | +### 1) Generate an SSH key (if you don’t already have one) |
| 36 | + |
| 37 | +On each HPC machine where you plan to run the sync: |
| 38 | + |
| 39 | +```bash |
| 40 | +ssh-keygen -t ed25519 |
| 41 | +``` |
| 42 | + |
| 43 | +Accept the default path (`~/.ssh/id_ed25519`) unless you have a reason to use a |
| 44 | +different one. Don’t share your private key. |
| 45 | + |
| 46 | +### 2) Add your public key to your CELS account |
| 47 | + |
| 48 | +- Copy the content of your public key (typically `~/.ssh/id_ed25519.pub`). |
| 49 | +- Visit https://accounts.cels.anl.gov/ and add it under your account’s SSH |
| 50 | + keys. |
| 51 | +- Give it a descriptive name (e.g., `andes`, `frontier`, `compy`). |
| 52 | +- Allow a few minutes for the new key to propagate. |
| 53 | + |
| 54 | +### 3) Configure your `~/.ssh/config` |
| 55 | + |
| 56 | +We recommend a control connection and a short host alias for Chrysalis: |
| 57 | + |
| 58 | +```ini |
| 59 | +Host * |
| 60 | + ControlMaster auto |
| 61 | + ControlPath ~/.ssh/connections/%r@%h:%p |
| 62 | + ServerAliveInterval 300 |
| 63 | + ServerAliveCountMax 3 |
| 64 | + |
| 65 | +Host chrys |
| 66 | + Hostname chrysalis.lcrc.anl.gov |
| 67 | + User <cels_username> |
| 68 | + ProxyJump <cels_username>@logins.lcrc.anl.gov |
| 69 | +``` |
| 70 | + |
| 71 | +Also create the connections directory if it doesn’t exist: |
| 72 | + |
| 73 | +```bash |
| 74 | +mkdir -p ~/.ssh/connections |
| 75 | +chmod 700 ~/.ssh ~/.ssh/connections |
| 76 | +``` |
| 77 | + |
| 78 | +#### OLCF (Andes, Frontier) extra settings |
| 79 | + |
| 80 | +Some OLCF systems require explicit auth options. Add these lines to the |
| 81 | +`chrys` host in your SSH config if you’re on Andes/Frontier: |
| 82 | + |
| 83 | +```ini |
| 84 | +Host chrys |
| 85 | + Hostname chrysalis.lcrc.anl.gov |
| 86 | + User <cels_username> |
| 87 | + ProxyJump <cels_username>@logins.lcrc.anl.gov |
| 88 | + IdentityFile ~/.ssh/id_ed25519 |
| 89 | + PreferredAuthentications publickey,keyboard-interactive |
| 90 | + PasswordAuthentication no |
| 91 | +``` |
| 92 | + |
| 93 | +--- |
| 94 | + |
| 95 | +## Recommended workflow |
| 96 | + |
| 97 | +1) Start a background control connection to Chrysalis (you’ll be prompted for |
| 98 | + Duo): |
| 99 | + |
| 100 | +```bash |
| 101 | +ssh -MNf chrys |
| 102 | +``` |
| 103 | + |
| 104 | +You should be returned to your original login shell after MFA. |
| 105 | + |
| 106 | +2) Run the sync. For example, to pull diagnostics from Chrysalis to your |
| 107 | + current machine: |
| 108 | + |
| 109 | +```bash |
| 110 | +mache sync diags from chrysalis -u <cels_username> |
| 111 | +``` |
| 112 | + |
| 113 | +If the control connection is active, you shouldn’t be prompted for Duo again. |
| 114 | +You’ll see `rsync` output similar to: |
| 115 | + |
| 116 | +``` |
| 117 | +running: rsync --verbose --recursive --times --links --compress --progress --update --no-perms --omit-dir-times <cels_username>@chrysalis.lcrc.anl.gov:/lcrc/group/e3sm/public_html/diagnostics/ /path/to/local/diagnostics |
| 118 | +receiving incremental file list |
| 119 | +grids/ocean.RRSwISC6to18E3r5.mask.scrip.20240327.nc |
| 120 | + 633,767,353 100% 16.58MB/s 0:00:36 (xfr#1, ir-chk=1293/1488) |
| 121 | +grids/ocean.RRSwISC6to18E3r5.nomask.scrip.20240327.nc |
| 122 | + 633,767,353 100% 26.88MB/s 0:00:22 (xfr#2, ir-chk=1292/1488) |
| 123 | +... |
| 124 | +``` |
| 125 | + |
| 126 | +3) When you’re done, close the control connection: |
| 127 | + |
| 128 | +```bash |
| 129 | +ssh -O exit chrys |
| 130 | +``` |
| 131 | + |
| 132 | +Notes: |
| 133 | +- When pulling data (`from`), `mache` will automatically fix permissions on |
| 134 | + the local destination according to machine settings. |
| 135 | +- Destination paths are derived from your machine configuration (diagnostics |
| 136 | + base path), and source paths from the LCRC machine configuration. |
| 137 | + |
| 138 | +--- |
| 139 | + |
| 140 | +## Command reference |
| 141 | + |
| 142 | +Basic usage: |
| 143 | + |
| 144 | +```text |
| 145 | +mache sync diags to <other> [-u <username>] [-m <this_machine>] [-f <config_file>] |
| 146 | +mache sync diags from <other> [-u <username>] [-m <this_machine>] [-f <config_file>] |
| 147 | +``` |
| 148 | + |
| 149 | +- `to | from` — direction of sync |
| 150 | +- `<other>` — the other machine name (e.g., `chrysalis`) |
| 151 | +- `-u, --username` — the username to use on the other machine (required in |
| 152 | + practice) |
| 153 | +- `-m, --machine` — explicitly set the name of the current machine |
| 154 | + (auto-detected if omitted) |
| 155 | +- `-f, --config_file` — path to a config file that overrides defaults for the |
| 156 | + current machine |
| 157 | + |
| 158 | +Constraints enforced by the command: |
| 159 | + |
| 160 | +- Only `anvil`/`chrysalis` are considered LCRC machines |
| 161 | +- `to` is only allowed when you are on an LCRC machine |
| 162 | +- `from` is only allowed when the other machine is an LCRC machine |
| 163 | +- Do not attempt to sync between two different LCRC machines (there is no need |
| 164 | + and this wastes bandwidth) |
| 165 | + |
| 166 | +--- |
| 167 | + |
| 168 | +## Troubleshooting |
| 169 | + |
| 170 | +- You get Duo prompts during rsync: |
| 171 | + - Ensure the control connection is active (`ssh -MNf chrys`) and your |
| 172 | + `Host chrys` alias matches the command you used to connect. |
| 173 | +- Permission errors on the destination: |
| 174 | + - Verify your local diagnostics base path exists and that you have write |
| 175 | + access; `mache` adjusts group/world permissions on pull, but can’t create |
| 176 | + parent paths that don’t exist. |
| 177 | +- Connection fails through the login proxy: |
| 178 | + - Double-check `ProxyJump <cels_username>@logins.lcrc.anl.gov` and that your |
| 179 | + public key is present at https://accounts.cels.anl.gov/. |
0 commit comments