Skip to content

docs(validate): operator runbook pages (MTN-116)#340

Draft
JohnnyWyles wants to merge 6 commits into
mainfrom
chore/mtn-116-validate-operator-pages
Draft

docs(validate): operator runbook pages (MTN-116)#340
JohnnyWyles wants to merge 6 commits into
mainfrom
chore/mtn-116-validate-operator-pages

Conversation

@JohnnyWyles

@JohnnyWyles JohnnyWyles commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Validate section: node/validator operational runbooks (MTN-116, under the MTN-111 content umbrella).

Operator review required. These are safety-critical (downtime and double-signing are slashable). I drafted the content that can be grounded in canonical sources (Cosmovisor, pruning/config, Prometheus instrumentation) and left explicit TODO(operator) markers + :::caution/:::danger admonitions for the validator-specific bits that must be confirmed before publishing. Search the diff for TODO(operator) to find every spot needing your input.

Review links point at this PR's Vercel preview build.

New pages

  • Chain Upgrades and Cosmovisor — how governance upgrades work, Cosmovisor layout/env, pre-staging binaries. TODO: exact per-upgrade procedure + missed-upgrade recovery.
  • Sync Options — snapshot vs state sync vs genesis, pruning vs archive, choosing by use case. TODO: current snapshot providers / state-sync RPCs.
  • Node Configuration and Maintenance — app.toml / config.toml tuning. TODO: current seeds/peers.
  • Monitoring and Alerting — Prometheus instrumentation, what to alert on. TODO: team Grafana dashboards / alert rules.
  • Validator Security and Recovery — key security, sentry architecture, backup/DR. The double-signing hazard is stated unambiguously; the sentry topology and the exact backup/DR runbook are TODO (must not be guessed).

The Validate sidebar was reordered into a logical operator flow (install -> join -> upgrades -> sync -> config -> monitoring -> performance -> security -> tmkms -> validating -> relayer).

Safety-critical items to confirm before merge

  • Per-upgrade procedure and missed-upgrade recovery (upgrades page)
  • Current snapshot providers and state-sync servers (sync-options)
  • Current seeds / persistent peers (node-configuration)
  • Sentry topology + firewall/DDoS posture (security)
  • Backup contents + failover runbook + priv_validator_state.json handling (security)

Cross-branch note

node-configuration references the network base fee but links the Integrate section generally rather than /integrate/fees (that page is on the unmerged MTN-114 branch, #338). Can be pointed at /integrate/fees once #338 merges.


Accuracy sweep (added after initial review)

A full multi-agent, adversarially-verified audit of the Validate section followed the initial runbooks (live-chain checks where relevant). 8 files changed.

Verified fixes:

  • Blocker (financial): node-configuration.md minimum-gas-prices example was 0.025uosmo, below the 0.03 base fee and contradicting its own instruction → 0.03uosmo with a fee-market query pointer.
  • validating-mainnet/testnet: create-validator narration corrected (example uses 400 OSMO, bullets said 500); invalid wosmongton@osmosis.labs contact domain → osmosis.team; gas-prices described as price-per-gas; malformed query staking validators flag order; KEY_NAME consistency; H1 casing; --chain-id dropped from read-only signing-info queries; testnet chain-id note.
  • joining-mainnet/testnet: added --chain-id to init; reconciled RAM (32 GB minimum, 64 GB recommended for validators); added a caution that testnet seeds/peers rotate and must be confirmed against the chain registry.
  • performance: fixed the benchstat example (step 2 checked out master instead of the feature branch); sanitized a leaked host/home path; dropped the invalid heap ?seconds param; converted en-dash glyphs to markdown bullets; removed a duplicate line.
  • relayer-guide: matched the hermes sample output (1.13.3) to the build tag; replaced filler memo_prefix.
  • index: extended the landing prose to cover the operations guides.

Still needs operator input: 6 TODO(operator) markers remain (snapshot providers, seeds/peers, sentry topology, backup/DR runbook, monitoring dashboards, missed-upgrade recovery).

Build passes.


Note

Low Risk
Documentation-only changes; no runtime or application code. Residual risk is publishing unverified operator guidance until TODO(operator) items are filled in.

Overview
Adds five new Validate operator runbooks (MTN-116): Chain Upgrades and Cosmovisor, Sync Options, Node Configuration and Maintenance, Monitoring and Alerting, and Validator Security and Recovery. Together they cover governance upgrades, sync/pruning choices, app.toml/config.toml tuning, Prometheus alerting, and double-signing / sentry / DR guidance, with cross-links into existing install, TMKMS, and performance pages.

Sidebar order on existing Validate docs is updated (sidebar_position only on performance, TMKMS, validating mainnet/testnet, relayer) so the section reads as an operator flow: install → join → upgrades → sync → config → monitoring → performance → security → TMKMS → validating → relayer.

Several spots are explicitly draft / operator-verify: TODO(operator) for live seeds, snapshot URLs, Grafana rules, sentry topology, and backup runbooks; :::caution / :::danger on values that must not be guessed before publish.

Reviewed by Cursor Bugbot for commit ad87df8. Bugbot is set up for automated code reviews on this repo. Configure here.

@vercel

vercel Bot commented Jun 8, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
docs Ready Ready Preview, Comment Jun 18, 2026 10:48am

Request Review

The Osmosis gas_price of 0.0026uosmo predated the dynamic fee market
and would get the relayer's txs rejected; bump to 0.03 and note that it
must track the base fee (query with `osmosisd query txfees base-fee`).
Update the stale `hermes 1.0.0` version example.
Apply verified findings from a full audit of the Validate section:

- node-configuration: correct the minimum-gas-prices example to 0.03uosmo (the old
  0.025uosmo was below the dynamic base fee and contradicted its own instruction);
  point operators at the fee-market base-fee query.
- validating-mainnet/testnet: fix the create-validator narration (the example uses
  400 OSMO, the bullets said 500); correct the invalid wosmongton@osmosis.labs
  contact domain; describe gas-prices as price-per-gas not gas amount; fix the
  malformed `query staking validators` flag order; align KEY_NAME usage; lowercase
  the H1; drop --chain-id from read-only signing-info queries; fix the testnet
  chain-id note.
- joining-mainnet/testnet: add --chain-id to init; reconcile the RAM recommendation
  (32 GB minimum, 64 GB recommended for validators); add a caution that testnet
  seeds/peers rotate and must be confirmed against the chain registry.
- performance: fix the benchstat example (step 2 checked out master instead of the
  feature branch); sanitize a leaked host/home path; drop the invalid heap
  ?seconds param; convert en-dash glyphs to markdown bullets; remove a duplicate.
- relayer-guide: match the hermes sample output (1.13.3) to the build tag; replace
  filler memo_prefix with a self-documenting placeholder.
- index: extend the landing prose to cover the operations guides (sync options,
  node configuration, upgrades, monitoring, security).

Intentional TODO(operator) markers (snapshot providers, seeds, sentry topology,
backup/DR) are left in place pending operator input.
Fills the two operator stubs that can be sourced authoritatively.

- sync-options: list the official snapshots.osmosis.zone and Polkachu snapshot
  providers; narrow the remaining caution to state-sync RPC servers only.
- node-configuration: document the official seeds that `osmosisd init` writes
  (seed.osmosis.zone, seeds.polkachu.com) and point at the Cosmos chain
  registry for the current full seed/peer set.

The deployment-specific runbooks (monitoring dashboards, sentry topology,
backup/failover, upgrade recovery, state-sync RPC servers) remain marked
TODO(operator); they are safety-critical and await confirmation from ops.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant