hyperledger-labs · isegall-da · May 22, 2025
diff --git a/.github/ISSUE_TEMPLATE/cut-release-and-deploy.md b/.github/ISSUE_TEMPLATE/cut-release-and-deploy.md
@@ -0,0 +1,84 @@
+---
+name: Cut release and deploy
+about: Cut release and upgrade DevNet nodes
+title: Cut and deploy release 0.x.y
+labels: ""
+assignees: ""
+---
+
+## Cut release
+
+Note: Some commands assume you are using the [fish](https://fishshell.com/) shell.
+If you are using other shells, you may need to adjust the commands accordingly.
+For example, `foo (bar)` in `fish` is equivalent to `foo $(bar)` in `bash`.
+
+The `VERSION` file specifies the release version to build, here referred to as `0.x.z`.
+The previous release is referred to as `0.x.y` in these instructions.
+
+Regular releases are started from `origin/main`, while bugfix / patch releases are started from a previous release line.
+For the rest of this checklist, this will be called the _ancestor branch_.
+
+Release versions can only be published from a _release line branch_. The _release line branch_ is `release-line-0.x.z` for release `0.x.z` in these instructions.
+A _release line branch_ is branched from the _ancestor branch_.
+
+- [ ] Choose the _ancestor branch_. This can be `origin/main` for all regular releases or `origin/release-line-0.x.y` for bugfix releases.
+- [ ] Wait for everything to be merged in the _ancestor branch_ that we want in `0.x.z`.
+  - [ ] ...
+- [ ] Ensure all changes to the previous release branch `origin/release-line-0.x.y` are also included in both the _ancestor branch_ and `origin/main`.
+      This should be the case but sometimes a change gets missed.
+    - Use one of the following approaches to find changes applied to release line `0.x.y` after it was branched off its ancestor branch (which may be different from the ancestor branch of the new release).
+        - Run `git diff (git merge-base origin/release-line-0.x.y ANCESTOR_BRANCH) origin/release-line-0.x.y` and compare it to the checked out source code of the release line you're upgrading to.
+        - Run `git log (git merge-base origin/release-line-0.x.y ANCESTOR_BRANCH)..origin/release-line-0.x.y` and compare it to the log of the release line you're upgrading to.
+        - Open https://github.com/DACH-NY/canton-network-node/compare/BRANCH_COMMIT...release-line-0.x.y to see the changes in the GitHub UI, where `BRANCH_COMMIT` is the commit that the release line was branched off from.
+- [ ] Merge a PR into the _ancestor branch_ with the following changes:
+  - [ ] Update the release notes (`docs/src/release_notes.rst`):
+    - Replace `Upcoming` by the target version
+    - Fix any spelling mistakes and make sure the RST rendering is not broken
+    - Check whether any important changes are missing, for example by briefly comparing the release notes with `git log 0.x.y..` (replace `0.x.y` with the prev version)
+- [ ] Create a release branch called `release-line-0.x.z` from the merged commit
+    - Note: release branches are subject to branch protect rules. Once you push the branch, you need to open PRs to make further changes.
+- [ ] Merge a PR into the release branch (`origin/release-line-0.x.z`) with the following changes:
+  - [ ] Create an empty commit with `[release]` in the commit message so it gets published as a non-snapshot version. You may have to edit the commit message when pressing the merge button in the GitHub UI.
+- [ ] Trigger a CircleCI pipeline from the DA-internal (on main) with `run-job: publish-release-artifacts` and `splice-git-ref: release-line-0.x.z`
+- [ ] If _ancestor branch_ is not `origin/main`, forward port all changes made to the _ancestor branch_ as part of this release to `origin/main`
+- [ ] Update the Open source repos, see https://github.com/DACH-NY/canton-network-node/blob/main/OPEN_SOURCE.md
+  - [ ] Merge the auto-generated PR in https://github.com/digital-asset/decentralized-canton-sync
+  - [ ] Merge the auto-generated PR in https://github.com/hyperledger-labs/splice
+- [ ] After merging the PR on the DA OSS repo, go to Releases in that repo
+      (https://github.com/digital-asset/decentralized-canton-sync/releases), find the draft
+      release for the release you just created and publish it (click the edit pencil icon). This should be done after merging the PR because it will
+      also automatically bundle the sources from the release-line branch.
+- [ ] Merge a PR into the _ancestor branch_ with the following changes:
+  - Update `VERSION` and `LATEST_RELEASE`. `VERSION` should be the next planned release (typically bumping the minor version), and `LATEST_RELEASE` should be the version of the newly created release line.
+- [ ] Communicate to partners that a new version is available
+
+## Upgrade our own nodes on DevNet
+
+- [ ] If significant time has passed since cutting the release, ensure that there are no changes that need to be backported to the release branch.
+      In particular, check for changes to the `cluster/configs` and `cluster/configs-private` submodules.
+- [ ] Merge a PR into the release branch (`origin/release-line-0.x.z`) with the following changes:
+  - [ ] Update the cluster `config.yaml` file by setting the new reference under `synchronizerMigration.active.releaseReference` and update the `synchronizerMigration.active.version` to version `0.x.y`.
+  - [ ] Update `cluster/deployment/devnet/.envrc.vars`, bumping the release version.
+    - Currently, the affected env vars are `OVERRIDE_VERSION`, `CHARTS_VERSION`, and `MULTI_VALIDATOR_IMAGE_VERSION`.
+  - [ ] Before merging, open the `preview_pulumi_changes` CircleCi workflow and approve the jobs to generate `deployment` and `devnet` previews.
+    Review the changes together with someone else, paying particular attention to deleted or newly created resources.
+- [ ] Warn our partners on [#supervalidator-operations](https://daholdings.slack.com/archives/C085C3ESYCT): "We'll be upgrading the DA-2 and DA-Eng nodes on DevNet to test a new version. Some turbulence might be expected."
+- [ ] Forward-port the changes to `config.yaml` and `cluster/deployment/devnet/.envrc.vars` to `main`. The `deployment` stack, which watches `main`, should pick that up
+and upgrade the other pulumi stacks.
+- [ ] Wait for [the operator](https://github.com/DACH-NY/canton-network-node/tree/main/cluster#the-operator) to apply your changes
+    - A good check is `kubectl get stack -n operator -o json | jq '.items | .[] | {name: .metadata.name, status: .status}'` should show all stacks as successful and on the right commit.
+      Remember to check that the `lastSuccessfulCommit` field points to the release line that you expect.
+- [ ] Confirm that we didn't break anything; for example:
+  - [ ] The [SV Status Report Dashboard](https://grafana.dev.global.canton.network.digitalasset.com/d/caffa6f7-c421-4579-a839-b026d3b76826/sv-status-reports?orgId=1) looks green
+  - [ ] There are no (unexpected) open alerts
+  - [ ] The docs are reachable at both https://dev.network.canton.global/ and https://dev.global.canton.network.digitalasset.com/
+
+## Upgrade our own nodes on TestNet and MainNet
+
+- [ ] One week after DevNet: TestNet
+- [ ] One week after TestNet: MainNet
+
+## Follow up
+
+- [ ] If you cut a release, remind the next person in the [rotation](https://docs.google.com/document/d/1f0nVeRnnxKQxwPi5nI2TiMq6qtHPwgiOjtUPUVJMKIk/edit?tab=t.0) that is up for cutting a release next week.
+- [ ] Persist any lessons learned and fix (documentation) bugs hit
diff --git a/.github/ISSUE_TEMPLATE/hard-migration---disaster-recovery.md b/.github/ISSUE_TEMPLATE/hard-migration---disaster-recovery.md
@@ -0,0 +1,62 @@
+---
+name: 'Hard migration / Disaster recovery'
+about: 'Perform a hard migration or disaster recovery on a production cluster'
+title: 'NETWORK [Hard Synchronizer Migration|Disaster Recovery] DATE'
+labels: ''
+assignees: ''
+
+---
+
+Agenda [here](https://docs.google.com/document/d/1AEh9ZMLPxmc9tKn0L7I5S48xHOR4GN__VpP2_IHnL0A/edit#heading=h.9pjnt72egfzq). *PLEASE UPDATE LINK*
+
+Tracking sheet [here](https://docs.google.com/spreadsheets/d/1AKAVhGqxFkhe7kBnbLf9L-nfnpr1H0QjZ5PHMaVkvc8/edit?gid=128511196). *PLEASE UPDATE LINK*
+
+Internal runbook [here](https://github.com/DACH-NY/canton-network-node/blob/main/cluster/README.md#via-the-pulumi-operator).
+
+## Checklist
+
+### Prepare
+
+- [ ] If you are upgrading to a new Canton major version, manually test compatibility of dev/test/mainnet snapshots as described in `cluster/README.md`
+- [ ] open or create corresponding agenda and tracking sheet in [this folder](https://drive.google.com/drive/folders/1-HZPAiZ7wVei4nlp-AOyQ5TrZ-S5FVSB)
+- [ ] prepare our staging nodes and tell partners
+  - [ ] (later) forward-port to branches that may serve as potential future release sources (e.g. for 0.2.8, 0.2 and `main`; `main` is always included)
+- [ ] a sufficient number of partners have reported that they are ready / prepared (or look as if they are); check once and escalate if check failed
+- [ ] (only if hard migration) vote on scheduled downtime
+- [ ] disable periodic CI jobs (including sv and validator runbook resets) on `main`
+- [ ] (only if DevNet) take down multi-validator stack (does not handle hard domain migrations in its current form): `cncluster pulumi multi-validator down` from release branch
+- [ ] (only if disaster recovery) test the `cncluster take_disaster_recovery_dumps` step
+- [ ] take backups with `cncluster backup_nodes` (all nodes in parallel!) as you would during the meeting - to confirm that the commands work for you and to have them ready
+- [ ] (shortly before the call) request PAM in case you'll need it later
+
+### Call with all SVs (hard migrations version; remove me if DR)
+
+- [ ] wait for current synchronizer to pause and dumps to be taken (`Wrote domain migration dump` in SV app / validator app logs)
+- [ ] ensure that apps are sufficiently caught up
+- [ ] take backups with `cncluster backup_nodes`
+- [ ] merge PRs for deployment branch & `main` to migrate to higher migration ID
+- [ ] check: domain is healthy
+
+### Call with all SVs (DR version; remove me if hard migration)
+
+- [ ] everyone scales down their CometBFT nodes with `kubectl scale deployment --replicas=0 -n <namespace> global-domain-<old-migration-id>-cometbft`
+- [ ] take backups with `cncluster backup_nodes`
+- [ ] agree on a timestamp based on logs (e.g., ask everyone for the `toInclusive` value of their latest `Commitment correct for sender and period CommitmentPeriod(fromExclusive = X, toInclusive = THIS)` log entry on the participant and use the min of that)
+- [ ] get the dumps with `cncluster take_disaster_recovery_dumps`
+- [ ] copy the dumps into our PVCs with `cncluster copy_disaster_recovery_dumps`
+- [ ] merge PRs for deployment branch & `main` to migrate to higher migration ID
+- [ ] check: domain is healthy
+
+### Cleanup
+
+- [ ] unset `synchronizerMigration.active.migratingFrom` on the release branch so that future redeploys don't attempt to migrate
+  - [ ] (later) forward-port to branches that may serve as potential future release sources
+- [ ] trigger periodic CI jobs manually once to make sure the updates worked
+- [ ] re-enable periodic CI jobs on `main`
+- [ ] recheck above forward-port items (e.g. versions, migration IDs)
+- [ ] take down old synchronizer nodes (once we're allowed to based on agreement with other SVs)
+
+### Follow-up
+
+- [ ] make sure that the [next planned production network operation](https://docs.google.com/document/d/14gZQNdXLPUCfqxN4vLK_yGlsptfcHMJZR8e1oOKgqLc/edit) has assignees and will get done; escalate if this is not the case
+- [ ] improve docs (collect ideas here) and other things
diff --git a/.github/ISSUE_TEMPLATE/reset-production-cluster.md b/.github/ISSUE_TEMPLATE/reset-production-cluster.md
@@ -0,0 +1,31 @@
+---
+name: Reset production cluster
+about: (Planned) reset of DevNet or TestNet
+title: Reset production cluster
+labels: ""
+assignees: ""
+---
+
+Scheduled for: *date + time*
+
+- [ ] (a few days before the scheduled time) Remind people on [#supervalidator-operations](https://daholdings.slack.com/archives/C085C3ESYCT) and [#validator-operations](https://daholdings.slack.com/archives/C08AP9QR7K4) that the reset is planned.
+- [ ] Merge a PR to the correct release branch that unprotects the databases so they can be deleted. Wait for this change to be applied (you can check [grafana](https://grafana.test.global.canton.network.digitalasset.com/d/QP_wDqDnz/pulumi-operator-stacks-dashboard?orgId=1) for example). Example PR: #16323
+- [ ] Prepare (don't merge yet!) a PR against the release branch for bootstrapping the new cluster (example: #16324). This includes but might not be limited to:
+  - [ ] Set migration ID to 0
+  - [ ] Increment `COMETBFT_CHAIN_ID_SUFFIX` by 1
+  - [ ] Remove any `legacy` or `archive` migrations that might still be around.
+  - [ ] Make sure that the current config of the running cluster matches what you will bootstrap (see `INITIAL_PACKAGE_CONFIG_JSON`)
+- [ ] (before starting the actual reset) Send another update to SVs and validators as well as internal channels that could be relevant.
+- [ ] (while pairing with someone) Uninstall the pulumi operator with `cncluster pulumi operator down`. Note that this does not take down the actual components, only the operator stack resources, so that the operator does not
+      accidentally kick in at wrong times.
+- [ ] (while pairing with someone) Reset all the sv-canton stacks for **archived** migrations one day before the actual reset. You can also delete PVC snapshots and CloudSQL backups for the active migration. Expect this to be slow, which is why it's done one day in advance.
+- [ ] (while pairing with someone) Reset all stacks **except** the `infra` stack manually. `cncluster reset` could work, `CI=1 cncluster pulumi XYZ down --yes --skip-preview` will certainly work (check `kubectl get stacks -A` for stacks you should down and don't forget to also down the `deployment` stack). Expect some slowness/timeouts/rate-limiting from GCP.
+- [ ] Merge the PR you prepared above
+- [ ] Forward-port the PR you prepared above to `main`.
+- [ ] Once merged to main, redeploy the operator through CircleCI: on `main`, trigger `run-jon: deploy-operator`, `cluster: ...`.
+- [ ] Wait for the network to deploy and confirm that the `AmuletRules` `packageConfig` contains the expected DAR versions.
+- [ ] Prepare and merge a second PR to the release branch that configures wallet sweeps to the DA-Wallet party (`SV1_SWEEP`, you need their wallet to be onboarded, you can ask in [#da-wallet](https://daholdings.slack.com/archives/C073K97TL3U)) and (unless you already did this earlier) sets the `cloudSql.protect` back to `false` (example PR: https://github.com/DACH-NY/canton-network-node/pull/16329).
+- [ ] Tell SVs: "You are welcome to join now with migration ID 0 and chain ID X. Please reset your existing nodes completely, clearing out all databases and PVCs, and then onboard afresh." (example text, tweak as needed)
+- [ ] Tell validators: "Please wait until bootstrapping has complete and join in 2h from now, using migration ID 0. Please reset your existing nodes completely, clearing out all databases, and then onboard afresh." (example text, tweak as needed)
+- [ ] Forward port the final state on the release branch to main
+- [ ] Fix anything in this template that you didn't like
diff --git a/.github/ISSUE_TEMPLATE/tech-debt---support-issue.md b/.github/ISSUE_TEMPLATE/tech-debt---support-issue.md
@@ -0,0 +1,18 @@
+---
+name: Tech debt / Support issue
+about: Create a tech debt or support issue (with some structure)
+title: ''
+labels: ''
+assignees: ''
+
+---
+
+## What is this about?
+
+*(your description here)*
+
+*Remove this line once you have selected the correct milestone.*
+
+## How important is this and why?
+
+*(your estimate and thoughts here)*
diff --git a/.github/actionlint.yml b/.github/actionlint.yml
@@ -0,0 +1,10 @@
+self-hosted-runner:
+  labels:
+    - self-hosted-docker-tiny
+    - self-hosted-docker-medium
+    - self-hosted-docker-large
+    - self-hosted-k8s-x-small
+    - self-hosted-k8s-small
+    - self-hosted-k8s-medium
+    - self-hosted-k8s-large
+    - self-hosted-k8s-x-large
diff --git a/.github/actions/cache/daml_artifacts/restore/action.yml b/.github/actions/cache/daml_artifacts/restore/action.yml
@@ -0,0 +1,34 @@
+name: "Restore Daml artifacts"
+description: "Restore the Daml artifacts cache"
+inputs:
+  cache_version:
+    description: "Version of the cache"
+    required: true
+outputs:
+  cache_hit:
+    description: "Cache hit"
+    value: ${{ steps.restore.outputs.cache-hit }}
+
+runs:
+  using: "composite"
+  steps:
+    - name: Restore Daml artifacts cache
+      id: restore
+      uses: actions/cache/restore@v4
+      with:
+        path: |
+          /tmp/daml
+          apps/common/frontend/daml.js
+        key: daml-artifacts-${{ inputs.cache_version }} branch:${{ github.ref_name }} dependencies:${{ hashFiles('project/build.properties', 'project/BuildCommon.scala', 'project/DamlPlugin.scala', 'build.sbt', 'daml/dars.lock', 'nix/canton-sources.json') }} rev:${{ github.sha }}
+        restore-keys: |
+          daml-artifacts-${{ inputs.cache_version }} branch:${{ github.ref_name }} dependencies:${{ hashFiles('project/build.properties', 'project/BuildCommon.scala', 'project/DamlPlugin.scala', 'build.sbt', 'daml/dars.lock', 'nix/canton-sources.json') }}
+          daml-artifacts-${{ inputs.cache_version }} branch:main dependencies:${{ hashFiles('project/build.properties', 'project/BuildCommon.scala', 'project/DamlPlugin.scala', 'build.sbt', 'daml/dars.lock', 'nix/canton-sources.json') }}
+    - name: Extract Daml artifacts
+      shell: bash
+      run: |
+        if [[ -e /tmp/daml/daml.tar.gz ]]; then
+          tar --use-compress-program=pigz -xf /tmp/daml/daml.tar.gz
+        else
+          echo "No cached daml artifacts files found. Skipping..."
+        fi
+
diff --git a/.github/actions/cache/daml_artifacts/save/action.yml b/.github/actions/cache/daml_artifacts/save/action.yml
@@ -0,0 +1,32 @@
+name: "Save Daml artifacts"
+description: "Saves the Daml artifacts to the cache"
+inputs:
+  cache_version:
+    description: "Version of the cache"
+    required: true
+  load_cache_hit:
+    description: "Cache hit from the restore Daml artifacts job (should be the cache_hit output from the restore Daml artifacts job)"
+    required: true
+
+runs:
+  using: "composite"
+  steps:
+    - name: Archive Daml artifacts
+      if: ${{ ! fromJson(inputs.load_cache_hit) }}
+      shell: bash
+      run: |
+        mkdir -p /tmp/daml
+        find . -type d -name ".daml" | tar --use-compress-program=pigz -cf /tmp/daml/daml.tar.gz -T -
+    - name: Not archiving Daml artifacts
+      if: ${{ fromJson(inputs.load_cache_hit) }}
+      shell: bash
+      run: |
+        echo "Skipping Daml artifacts cache, as there was a cache hit"
+    - name: Cache precompiled classes
+      if: ${{ ! fromJson(inputs.load_cache_hit) }}
+      uses: actions/cache/save@v4
+      with:
+        path: |
+          /tmp/daml
+          apps/common/frontend/daml.js
+        key: daml-artifacts-${{ inputs.cache_version }} branch:${{ github.ref_name }} dependencies:${{ hashFiles('project/build.properties', 'project/BuildCommon.scala', 'project/DamlPlugin.scala', 'build.sbt', 'daml/dars.lock', 'nix/canton-sources.json') }} rev:${{ github.sha }}
diff --git a/.github/actions/cache/frontend_node_modules/restore/action.yml b/.github/actions/cache/frontend_node_modules/restore/action.yml