Improve docs

vcastellm · vcastellm · commit 9502d2706b69 · 2026-04-10T09:22:38.000+02:00
diff --git a/website/docs/usage/upgrade.md b/website/docs/usage/upgrade.md
@@ -2,25 +2,100 @@
 title: Upgrade methods
 ---
 
-Use one of the following methods (depending on the changes) to upgrade a cluster to a newer version.
+# Upgrade methods
 
-### Rolling upgrade
+Use the upgrade method that matches the kind of change you are making. In most cases, a rolling upgrade is the safest option because it lets you replace nodes gradually while keeping the cluster available. Use backup and restore when you are building a fresh cluster, migrating to new infrastructure, or when release notes require a full rebuild instead of mixed-version operation.
 
-Use the following procedure to rotate all cluster nodes, one server at a time:
+## Before you start
 
-1. Add a new server to the cluster with a configuration that joins them to the existing cluster.
-1. Stop dkron service on one of the old servers, if it was the leader allow a new leader to be elected. Note that it is better to remove the current leader at the end, to ensure a leader is elected from the new nodes.
-1. Use `dkron raft list-peers` to list current cluster nodes.
-1. Use `dkron raft remove-peer` to forcefully remove the old server.
-1. Repeat steps above until all old cluster nodes have been upgraded.
+Before upgrading any node:
 
-### Backup & Restore
+1. Read the release notes for the target version and check whether mixed-version clusters are supported during the transition.
+2. Make sure the current cluster is healthy and has quorum.
+3. Export the current jobs so you have a recovery point:
 
-Use the `/restore` API endpoint to restore a previously exported jobs file
+```bash
+curl -fsS http://localhost:8080/v1/jobs > backup.json
+```
+
+4. Inspect the current Raft peers so you know which server is the leader and which peer IDs are registered:
 
+```bash
+dkron raft list-peers
 ```
-curl localhost:8080/v1/jobs > backup.json
-curl localhost:8080/v1/restore --form 'file=@backup.json'
+
+:::tip
+When upgrading server nodes, it is usually best to leave the current leader for last. That reduces unnecessary leader elections while you rotate the rest of the cluster.
+:::
+
+## Rolling upgrade
+
+Use a rolling upgrade when you want to keep the cluster online and the target version supports a gradual transition.
+
+### Recommended order
+
+1. Upgrade agent-only nodes first.
+2. Upgrade follower server nodes one at a time.
+3. Upgrade the leader last.
+
+### Server rotation procedure
+
+Use the following procedure to replace server nodes one at a time:
+
+1. Add a new server running the target version and configure it to join the existing cluster.
+2. Wait until the new server has joined successfully and the cluster is healthy.
+3. Stop Dkron on one old server.
+4. If that server was the leader, wait until a new leader is elected before continuing.
+5. List the current peers and identify the old server's peer ID:
+
+```bash
+dkron raft list-peers
 ```
 
-This will restore all jobs and counters as they were in the export file.
+6. Remove the old server from the Raft configuration:
+
+```bash
+dkron raft remove-peer --peer-id <peer-id>
+```
+
+7. Confirm the cluster is healthy again.
+8. Repeat the process until every old server has been replaced.
+
+:::warning
+Do not remove multiple server nodes at once. Dkron needs a healthy Raft quorum to continue scheduling jobs.
+:::
+
+## Backup and restore
+
+Use backup and restore when you need to recreate the cluster on new infrastructure or when a rolling upgrade is not appropriate.
+
+### Export jobs from the existing cluster
+
+```bash
+curl -fsS http://localhost:8080/v1/jobs > backup.json
+```
+
+### Restore jobs into the new cluster
+
+After the new cluster is running and has elected a leader, restore the exported jobs file:
+
+```bash
+curl -fsS -X POST http://localhost:8080/v1/restore \
+  --form 'file=@backup.json'
+```
+
+The restore endpoint expects a multipart form field named `file`. If a job in the file already exists in the target cluster, it is overwritten with the definition from the backup.
+
+:::warning
+This export and restore flow restores job definitions from the `/v1/jobs` payload. It should not be treated as a full cluster snapshot, and it does not recreate Raft state or execution history.
+:::
+
+## After the upgrade
+
+After either method completes:
+
+1. Run `dkron raft list-peers` and confirm the expected server set is present.
+2. Verify that one node is leader and the cluster remains stable.
+3. Check the UI or API and confirm the expected jobs are present.
+4. Watch the next scheduled executions to ensure jobs are still running as expected.
+5. Keep the exported `backup.json` until you are confident the upgrade is complete.