GitHub - Leg1onary/galera_orchestrator_v2: Self-hosted ops panel for MariaDB Galera clusters — real-time monitoring, recovery wizard, rolling restart, SSH diagnostics, smart advisor. Single Docker container.

 ██████╗  █████╗ ██╗     ███████╗██████╗  █████╗
██╔════╝ ██╔══██╗██║     ██╔════╝██╔══██╗██╔══██╗
██║  ███╗███████║██║     █████╗  ██████╔╝███████║
██║   ██║██╔══██║██║     ██╔══╝  ██╔══██╗██╔══██║
╚██████╔╝██║  ██║███████╗███████╗██║  ██║██║  ██║
 ╚═════╝ ╚═╝  ╚═╝╚══════╝╚══════╝╚═╝  ╚═╝╚═╝  ╚═╝

 ██████╗ ██████╗  ██████╗██╗  ██╗███████╗███████╗████████╗██████╗  █████╗ ████████╗ ██████╗ ██████╗
██╔═══██╗██╔══██╗██╔════╝██║  ██║██╔════╝██╔════╝╚══██╔══╝██╔══██╗██╔══██╗╚══██╔══╝██╔═══██╗██╔══██╗
██║   ██║██████╔╝██║     ███████║█████╗  ███████╗   ██║   ██████╔╝███████║   ██║   ██║   ██║██████╔╝
██║   ██║██╔══██╗██║     ██╔══██║██╔══╝  ╚════██║   ██║   ██╔══██╗██╔══██║   ██║   ██║   ██║██╔══██╗
╚██████╔╝██║  ██║╚██████╗██║  ██║███████╗███████║   ██║   ██║  ██║██║  ██║   ██║   ╚██████╔╝██║  ██║
 ╚═════╝ ╚═╝  ╚═╝ ╚═════╝╚═╝  ╚═╝╚══════╝╚══════╝   ╚═╝   ╚═╝  ╚═╝╚═╝  ╚═╝   ╚═╝    ╚═════╝ ╚═╝  ╚═╝

Self-hosted control panel for MariaDB Galera clusters

Real-time monitoring · Full cluster recovery · Split-brain resolution · SSH diagnostics · Smart Advisor
All from a single Docker container. No agents on nodes. No plugins. No magic.

What is this?

Galera Orchestrator v2 connects to your nodes directly via SSH and MariaDB — no agents, no plugins, no sidecar containers. You get a full ops panel that covers daily monitoring, incident diagnostics, and worst-case cluster recovery, all from a browser.

  Your Browser  ──────►  Docker Container (:8000)
                               │
                    ┌──────────┴──────────┐
                    │  Vue 3 SPA          │  dark UI, PrimeVue 4
                    │  FastAPI REST API   │  /api/clusters/{id}/...
                    │  WebSocket          │  /ws/clusters/{id}
                    │  Background Poller  │  asyncio · per cluster · every 5s
                    └──────────┬──────────┘
                               │  SSH + MariaDB  (no agents)
              ┌────────────────┼────────────────┐
              ▼                ▼                ▼
           node-1           node-2           node-3
        SSH+MariaDB      SSH+MariaDB      SSH+MariaDB

Screenshots

Overview — healthy cluster with Smart Advisor

Overview — degraded cluster, availability alert

Topology — visual cluster map with node detail panel

Diagnostics — Smart Advisor with critical findings

Diagnostics — Connection Check (SSH + DB latency)

Diagnostics — Flow Control Monitor

Recovery — grastate.dat Inspector (dirty crash scenario)

Maintenance — Rolling Restart Wizard

What's New

Recovery & Diagnostics v2 — commit f80c074

This release adds 9 new features across monitoring and incident recovery. All backends are real — no mock data.

New on Overview

Feature	Description
Quorum Health Score	Live widget showing primary / non-primary / offline node counts. Colour-coded severity: `healthy` → `degraded` → `critical`. Direct link to Recovery Tools.

New in Diagnostics — Galera Health group

A dedicated tab group with three live Galera-specific panels:

Panel	Metric	Alert condition
Flow Control Monitor	`wsrep_flow_control_paused` — fraction of time cluster was flow-controlled	> 10% warn, > 30% critical
Cert Conflict Rate	`wsrep_local_cert_failures` delta per minute	rising trend — write-set conflicts between nodes
Disk Sentinel	gcache actual size vs `galera.cache` configured limit + `ibdata1` growth	> 90% of gcache limit → SST risk

New on Recovery page — standalone Recovery Tools

Five independent tools available at any cluster state (no need to be in a failure scenario):

Tool	What it does
grastate.dat Inspector	SSH-reads `grastate.dat` from every node. Compares `seqno`, `safe_to_bootstrap`, cluster `uuid`. Identifies the correct bootstrap candidate.
Node State Snapshot	One-shot pre-flight dump — collects `wsrep_*` status, disk, process list, active transactions, InnoDB status from all nodes in parallel. Useful for incident documentation before taking action.
IST vs SST Helper	Compares donor gcache size against the joiner's `seqno` gap. Tells you whether IST (fast, incremental) or SST (full copy) will happen before you rejoin — avoiding surprise full transfers.
Split-Brain Recovery Wizard	Resolves a split-brain cluster: select the trusted node, set `pc.bootstrap=YES` via `wsrep_provider_options`, verify primary component forms. Progress streamed live via WebSocket.
Full Cluster Recovery	Fully automatic: reads `grastate.dat`, picks the bootstrap candidate, bootstraps it, rejoins all remaining nodes in seqno order. Live terminal log. Requires explicit checkbox confirmation — it's destructive.

New API endpoints

GET  /{cluster_id}/diagnostics/flow-control      # wsrep_flow_control_paused live
GET  /{cluster_id}/diagnostics/cert-conflicts    # wsrep_local_cert_failures rate
GET  /{cluster_id}/diagnostics/disk-sentinel     # gcache vs ibdata1 sentinel
GET  /{cluster_id}/diagnostics/quorum-status     # quorum health score

GET  /{cluster_id}/recovery/grastate             # grastate.dat inspector
POST /{cluster_id}/recovery/snapshot             # pre-flight node snapshot
GET  /{cluster_id}/recovery/ist-sst-info         # IST vs SST recommendation
POST /{cluster_id}/recovery/split-brain          # split-brain recovery (202 async)
POST /{cluster_id}/recovery/full-cluster         # full cluster recovery (202 async)

Quick Start

Prerequisites

Requirement	Notes
Docker 24+	With Compose v2 plugin
SSH key	RSA or Ed25519, no passphrase, access to all Galera nodes
Python 3.8+	Installer only — bcrypt hash + Fernet key generation

Option A — One-line installer (recommended)

curl -fsSL https://raw.githubusercontent.com/Leg1onary/galera_orchestrator_v2/master/install.sh | bash

The installer:

Checks Docker, Docker Compose, Python 3
Creates ~/galera-orchestrator/, downloads compose file
Asks: login, password, SSH key path, port, HTTPS mode
Hashes password via bcrypt — plaintext never written
Auto-generates JWT_SECRET_KEY and FERNET_SECRET_KEY
Writes .env (chmod 600), pulls image, starts container

  Panel:  http://<your-server>:8000
  Login:  admin  (or whatever you entered)

Set COOKIE_SECURE=false if not behind TLS.

Option B — Docker Compose manually

curl -fsSL https://raw.githubusercontent.com/Leg1onary/galera_orchestrator_v2/master/docker-compose.ghcr.yml -o docker-compose.ghcr.yml
curl -fsSL https://raw.githubusercontent.com/Leg1onary/galera_orchestrator_v2/master/.env.example -o .env
nano .env
docker compose -f docker-compose.ghcr.yml up -d

Option C — Build from source

git clone https://github.com/Leg1onary/galera_orchestrator_v2.git
cd galera_orchestrator_v2
cp .env.example .env && nano .env
docker compose up -d

Update

bash ~/galera-orchestrator/update.sh

Backs up the DB, pulls the new image from GHCR, restarts. No data loss.

First-run setup

Settings → Clusters — create a cluster
Settings → Datacenters — add datacenter(s)
Settings → Nodes — add nodes with SSH + DB credentials
Select the cluster in the top bar → Overview lights up

Configuration

All config via .env. Required fields marked ✓.

Variable	Default	Req	Description
`ADMIN_USERNAME`	`admin`		Login username
`ADMIN_PASSWORD_HASH`		✓	bcrypt hash. `python3 -c "import bcrypt; print(bcrypt.hashpw(b'pass', bcrypt.gensalt(12)).decode())"`
`JWT_SECRET_KEY`		✓	Min 32 chars. `openssl rand -hex 32`
`FERNET_SECRET_KEY`		✓	Encrypts node passwords. `python3 -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"`
`SSH_KEY_PATH`	`~/.ssh/id_rsa`	✓	Host path. Mounted `:ro` into container.
`HOST_PORT`	`8000`		Exposed port
`COOKIE_SECURE`	`true`		`false` for plain HTTP dev
`DOCS_ENABLED`	`false`		`true` enables `/docs`. Dev only.
`DATABASE_URL`	`sqlite:////data/orchestrator.db`		Don't change — `/data` is a named volume
`SSH_CONNECT_TIMEOUT`	`5`		Seconds
`SSH_COMMAND_TIMEOUT`	`10`		Seconds
`DB_CONNECT_TIMEOUT`	`3`		Seconds

JWT_SECRET_KEY and FERNET_SECRET_KEY must differ. Server refuses to start if they match or contain change-me-*.

SSH Key Setup

One global key, all nodes. Bind-mounted read-only — never stored in the DB.

# Test before starting
ssh -i ~/.ssh/id_rsa -p 22 user@<node-host> "hostname && mysql -e 'SELECT 1'"

# Generate a new key if needed
ssh-keygen -t ed25519 -N "" -f ~/.ssh/galera_key
ssh-copy-id -i ~/.ssh/galera_key.pub user@<node-host>

No passphrase. The container cannot modify or exfiltrate the key (ro mount).

Pages & Features

Overview

Command center. Everything important at a glance.

Cluster Summary Bar — global status, node count, wsrep overview, flow control indicator
Quorum Health Widget (new) — live primary / non-primary / offline breakdown with severity badge. Turns red when quorum is at risk. One click → Recovery Tools.
NodeCards — per-node state with 30-point sparklines for flow_control_paused and recv_queue, read_only flag, maintenance badge
Replication Lag Alert — auto-shown banner when wsrep_local_recv_queue_avg > 0, per-node detail, wsrep_slave_threads recommendation
Advisor Widget — critical/warn count from Smart Advisor, click-through to Diagnostics
Event Log — real-time stream via WebSocket with severity badges

Nodes

Day-to-day node operations table.

Filter/sort by name, state, datacenter, contour
NodeDetailDrawer — full wsrep variable dump, SSH/DB latency, InnoDB status
Per-node: toggle read_only, enter/exit maintenance, restart MariaDB, one-click Rejoin
Clone node — duplicate config with optional credential override
Connection test — on-demand SSH + DB reachability with latency

State badges: SYNCED · JOINED · DONOR · DESYNCED · OFFLINE · DEGRADED

Topology

SVG canvas — visual map of the cluster.

Nodes and arbitrators grouped by datacenter zones
Connection lines with state colour coding
Hover tooltip with full status; click → NodeDetailDrawer
Live updates via WebSocket

Recovery

Two independent sections on the same page.

Recovery Wizard

Step-by-step flow for when the entire cluster is down.

Step	What happens
1 — Scan	SSH into each node, read cluster status + wsrep state, detect non-primary / offline nodes
2 — Bootstrap	Shows `seqno` from `grastate.dat`. Highlights the safe-to-bootstrap candidate. Manual override with explicit confirmation.
3 — Rejoin	Sequential rejoin with per-node progress tracking
4 — Done	Confirms cluster is healthy

Recovery Tools (new)

Five standalone tools available at any time, regardless of cluster health. Accessible via the tab rail below the wizard.

grastate.dat Inspector SSH-reads grastate.dat from every node. Compares seqno, safe_to_bootstrap, cluster uuid across all nodes. Highlights the best bootstrap candidate with a visual diff.

Node State Snapshot One-shot pre-flight dump. Collects wsrep_* variables, disk usage, process list, active transactions, InnoDB status from all nodes simultaneously. Returns structured JSON — use it to document the incident state before touching anything.

IST vs SST Helper Reads donor gcache size and joiner's seqno gap. Recommends IST (fast, incremental — gcache covers the gap) or SST (full state transfer — donor doesn't have enough gcache) before you rejoin. Saves you from surprise full transfers on production clusters.

Split-Brain Recovery Wizard Select the trusted primary-component node → sets pc.bootstrap=YES via wsrep_provider_options → restarts MariaDB on non-primary nodes → verifies the cluster reforms as a single primary component. All steps streamed live via WebSocket.

Full Cluster Recovery Automatic end-to-end sequence when all nodes are down. Reads grastate.dat, picks the bootstrap candidate (highest seqno, preferring safe_to_bootstrap=1), bootstraps it, then rejoins all remaining nodes in seqno-descending order. Live terminal log. Requires explicit checkbox confirmation — this operation is destructive.

Maintenance

Rolling Restart — one-by-one restart, waits for SYNCED before next node
Desync / Resync — toggle wsrep_desync without restart (safe for DDL / dumps)
Read-only toggle — per-node or cluster-wide
Backup Center — scan backup server, browse files, manage retention
Cluster-level operation lock — concurrent recovery + maintenance returns 409 Conflict
All destructive actions require explicit confirmation dialog

Settings

Clusters — CRUD, polling interval
Datacenters — logical groupings for Topology view
Contours — environment labels (prod / staging / etc.)
Nodes — SSH + DB credentials; passwords encrypted at rest via Fernet
Arbitrators — garbd nodes with SSH connectivity monitoring
System — global SSH/DB timeouts

Diagnostics Suite

Six tab groups in the Diagnostics page.

Analysis

Tab	Description
Advisor	Smart Advisor findings — prioritised list with severity badges and action links
Config Health	Automated checks: buffer pool sizing, `max_connections`, `wsrep_slave_threads`, `innodb_flush_log_at_trx_commit`, `wsrep_sync_wait`

Galera Health (new)

Tab	Metric	What it tells you
Flow Control	`wsrep_flow_control_paused`	Fraction of time the cluster was flow-controlled. > 10% = congestion, > 30% = degraded. Usually means a slow node or under-provisioned `wsrep_slave_threads`.
Cert Conflicts	`wsrep_local_cert_failures` rate/min	Write-set certification failures between nodes. Rising rate = hot-row conflicts or poor transaction isolation.
Disk Sentinel	gcache vs `galera.cache` + `ibdata1`	Warns when gcache is near its configured limit. Overflow forces SST instead of IST on the next rejoin — often 10–100× slower.

Config

Tab	Description
Connections	SSH + DB reachability and latency per node and arbitrator
Config Diff	Side-by-side `SHOW GLOBAL VARIABLES` comparison across nodes — highlights diverging values
Variables	Full variable dump per node with search/filter

Engine

Tab	Description
System Resources	CPU, RAM, disk via SSH. Warn 80%, critical 90%. Top-10 tables by size.
InnoDB Status	`SHOW ENGINE INNODB STATUS` with structured deadlock parser — victim, lock type, query snippet
SST Status	Detects stuck donor/joiner. One-click restart-SST.

Activity

Tab	Description
Process List	Live `INFORMATION_SCHEMA.PROCESSLIST` with per-PID kill and bulk-kill by state/user
Transactions	`INNODB_TRX` — transactions older than N seconds with lock and row info
Slow Queries	Live slow query list. Enable/disable slow query log per node at runtime.

Logs & Ops

Tab	Description
Error Log	MariaDB error log tail via SSH with colour-coded severity lines
Arbitrator Log	garbd log tail
Purge Binlogs	`PURGE BINARY LOGS BEFORE` with size preview
Flush	`FLUSH LOGS` · `FLUSH TABLES WITH READ LOCK` · `UNLOCK TABLES`

Smart Advisor

GET /api/clusters/{cluster_id}/advisor

Aggregates data from all diagnostic sources into a prioritised, actionable list. Shown as:

Full panel in Diagnostics → Advisor
Compact widget on Overview (critical/warn count)

Category	Source	Example finding
`config`	Config Health	InnoDB buffer pool at 25% of RAM — recommended 60–70%
`performance`	Config + lag	`wsrep_slave_threads` mismatch vs CPU cores
`replication`	Live state	`wsrep_local_recv_queue_avg > 0` on 2 nodes
`availability`	SST status	Node stuck in JOINING > 5 min — restart SST
`storage`	Disk	Disk at 87% on node-2 — top table `db.orders` 4.2 GB
`locking`	Transactions	3 transactions running > 5 min, oldest 14 min
`locking`	InnoDB	Deadlock detected — victim query and tables identified
`security`	Config	`innodb_flush_log_at_trx_commit ≠ 1` — durability risk

Severity: critical 🔴 · warn 🟡 · info 🔵

Each finding has an action that maps to a UI interaction: open panel, pre-fill node action, or open Recovery wizard.

API Reference

All endpoints cluster-scoped under /api/clusters/{cluster_id}/....
Auth: httpOnly JWT cookie. Set via POST /api/auth/login, validated via GET /api/auth/me.

Auth

Method	Path	Description
`POST`	`/api/auth/login`	Login — sets `access_token` cookie
`GET`	`/api/auth/me`	Validate session
`POST`	`/api/auth/logout`	Clear cookie

Clusters

Method	Path	Description
`GET/POST`	`/api/clusters`	List / create
`GET/PATCH/DELETE`	`/api/clusters/{id}`	Read / update / delete
`GET`	`/api/clusters/{id}/status`	Live status: nodes + arbitrators

Nodes

Method	Path	Description
`GET/POST`	`/{id}/nodes`	List / create
`GET/PATCH/DELETE`	`/{id}/nodes/{nid}`	CRUD
`POST`	`/{id}/nodes/{nid}/test-connection`	SSH + DB reachability
`POST`	`/{id}/nodes/{nid}/set-read-only`	Toggle `read_only`
`POST`	`/{id}/nodes/{nid}/desync`	`wsrep_desync ON`
`POST`	`/{id}/nodes/{nid}/resync`	`wsrep_desync OFF`
`POST`	`/{id}/nodes/{nid}/restart`	`systemctl restart mariadb`
`POST`	`/{id}/nodes/{nid}/rejoin`	Rejoin single offline node
`POST`	`/{id}/nodes/{nid}/kill-process/{pid}`	Kill by PID
`POST`	`/{id}/nodes/{nid}/kill-processes`	Bulk kill `{ state?, user? }`
`GET`	`/{id}/nodes/{nid}/error-log`	Tail error log
`POST`	`/{id}/nodes/{nid}/set-slow-query-log`	`{ enabled: bool }`

Diagnostics

Method	Path	Description
`POST`	`/{id}/diagnostics/check-all`	Full SSH + DB connectivity check
`GET`	`/{id}/diagnostics/config-diff`	Variable diff across nodes
`GET`	`/{id}/diagnostics/variables`	wsrep variables per node
`POST`	`/{id}/diagnostics/resources`	CPU / RAM / disk via SSH
`GET`	`/{id}/diagnostics/process-list`	`INFORMATION_SCHEMA.PROCESSLIST`
`GET`	`/{id}/diagnostics/slow-queries`	Slow query list
`POST`	`/{id}/diagnostics/disk-usage`	Detailed disk usage
`GET`	`/{id}/diagnostics/galera-status`	`wsrep` STATUS per node
`GET`	`/{id}/diagnostics/flow-control`	new `wsrep_flow_control_paused` live
`GET`	`/{id}/diagnostics/cert-conflicts`	new cert failure rate
`GET`	`/{id}/diagnostics/disk-sentinel`	new gcache + ibdata1 sentinel
`GET`	`/{id}/diagnostics/quorum-status`	new quorum health score

Recovery

Method	Path	Description
`POST`	`/{id}/recovery/scan`	Scan all nodes — grastate + wsrep
`POST`	`/{id}/recovery/bootstrap`	Bootstrap selected node
`POST`	`/{id}/recovery/rejoin/{nid}`	Rejoin single node
`DELETE`	`/{id}/recovery/cancel`	Cancel active recovery
`GET`	`/{id}/recovery/grastate`	new grastate.dat inspector
`POST`	`/{id}/recovery/snapshot`	new pre-flight node snapshot
`GET`	`/{id}/recovery/ist-sst-info`	new IST vs SST recommendation
`POST`	`/{id}/recovery/split-brain`	new split-brain recovery `202`
`POST`	`/{id}/recovery/full-cluster`	new full cluster recovery `202`

Async 202 endpoints return { operation_id, status: "started" }. Progress streamed via WebSocket operation_progress / operation_finished events.

Maintenance

Method	Path	Description
`POST`	`/{id}/maintenance/rolling-restart`	Rolling restart `202`
`DELETE`	`/{id}/maintenance/rolling-restart`	Cancel

WebSocket

WS /ws/clusters/{cluster_id}

Same httpOnly JWT cookie. Events:

Event	Key payload fields
`node_state_changed`	`node_id`, `state`, `wsrep_*`
`arbitrator_state_changed`	`arb_id`, `reachable`
`operation_started`	`operation_id`, `type`
`operation_progress`	`operation_id`, `message`, `level`
`operation_finished`	`operation_id`, `success`, `error?`
`log_entry`	`severity`, `message`

Reconnects automatically with exponential backoff. Falls back to 5s HTTP polling if WebSocket is unavailable.

Architecture

┌──────────────────────────────────────────────────┐
│                 Docker Container                 │
│                                                  │
│  ┌─────────────┐    ┌──────────────────────────┐ │
│  │  Vue 3 SPA  │◄───│  FastAPI                 │ │
│  │  (static)   │    │  REST + WebSocket         │ │
│  └─────────────┘    │  Background Poller        │ │
│                     └──────────┬────────────────┘ │
│                                │                  │
│                     ┌──────────▼────────────────┐ │
│                     │  SQLite /data/             │ │
│                     │  orchestrator.db           │ │
│                     └───────────────────────────┘ │
│                                                  │
│  /root/.ssh/id_rsa  (bind-mount :ro from host)   │
└──────────────────────────────────────────────────┘
            │  SSH + MariaDB
            ▼
      Galera Cluster Nodes

Layer	Technology
Backend	FastAPI 0.110+, SQLAlchemy 2 (async), Pydantic v2, slowapi
SSH	paramiko
DB driver	PyMySQL
Auth	python-jose JWT, bcrypt, cryptography Fernet
Frontend	Vue 3, Vite 5, Pinia, Vue Router, TanStack Vue Query, TypeScript
UI	PrimeVue 4, dark-only custom design system
Realtime	WebSocket + 5s polling fallback
Storage	SQLite, single file, named Docker volume

Project Structure

galera_orchestrator_v2/
├── backend/
│   ├── main.py                    # app bootstrap, middleware, lifespan
│   ├── auth.py                    # JWT + bcrypt, httpOnly cookie
│   ├── config.py                  # pydantic-settings
│   ├── models.py                  # SQLAlchemy ORM
│   ├── models_live.py             # in-memory live cluster state
│   ├── services/
│   │   ├── ssh_client.py          # paramiko wrapper (context manager)
│   │   ├── db_client.py           # PyMySQL wrapper
│   │   ├── poller.py              # asyncio background poller
│   │   ├── recovery.py            # scan / bootstrap / rejoin primitives
│   │   ├── maintenance.py         # rolling restart, desync
│   │   ├── operations.py          # cluster-level operation lock
│   │   ├── ws_manager.py          # WebSocket connection manager
│   │   ├── event_log.py           # ring buffer + broadcast
│   │   └── crypto.py              # Fernet encrypt/decrypt
│   └── routers/
│       ├── auth.py
│       ├── clusters.py
│       ├── nodes.py               # CRUD + all node actions
│       ├── recovery.py            # wizard: scan, bootstrap, rejoin
│       ├── recovery_advanced.py   # grastate, snapshot, ist-sst,
│       │                          # split-brain, full-cluster  ← new
│       ├── maintenance.py
│       ├── diagnostics.py         # 20+ endpoints incl. flow-control,
│       │                          # cert-conflicts, disk-sentinel,
│       │                          # quorum-status               ← new
│       ├── advisor.py
│       ├── backup.py
│       ├── contours.py
│       ├── settings.py
│       ├── version.py
│       └── ws.py
├── frontend/src/
│   ├── pages/
│   │   ├── OverviewPage.vue       # + QuorumHealthWidget         ← new
│   │   ├── DiagnosticsPage.vue    # + Galera Health group        ← new
│   │   ├── RecoveryPage.vue       # + Recovery Tools section     ← new
│   │   ├── NodesPage.vue
│   │   ├── TopologyPage.vue
│   │   ├── MaintenancePage.vue
│   │   ├── BackupsPage.vue
│   │   └── SettingsPage.vue
│   ├── components/
│   │   ├── overview/
│   │   │   ├── QuorumHealthWidget.vue   ← new
│   │   │   ├── NodeCard.vue
│   │   │   ├── AdvisorWidget.vue
│   │   │   ├── ReplicationLagAlert.vue
│   │   │   ├── ClusterSummaryBar.vue
│   │   │   └── EventLog.vue
│   │   ├── diagnostics/
│   │   │   ├── FlowControlPanel.vue     ← new
│   │   │   ├── CertConflictPanel.vue    ← new
│   │   │   ├── DiskSentinelPanel.vue    ← new
│   │   │   └── … (15 existing panels)
│   │   └── recovery/
│   │       ├── GrastateInspectorPanel.vue  ← new
│   │       ├── SnapshotPanel.vue            ← new
│   │       ├── IstSstHelper.vue             ← new
│   │       ├── SplitBrainWizard.vue         ← new
│   │       ├── FullClusterRecovery.vue      ← new
│   │       └── Step1–4 (wizard steps)
│   ├── api/
│   │   ├── recovery-advanced.ts     ← new
│   │   └── … (10 existing modules)
│   └── stores/                      # Pinia stores
├── tests/e2e/
├── Dockerfile
├── docker-compose.yml               # dev (build from source)
├── docker-compose.ghcr.yml          # prod (pull from GHCR)
├── .env.example
├── install.sh
└── update.sh

Data & Backup

All data: single SQLite file on a named Docker volume.

# Backup
docker exec galera-orchestrator sqlite3 /data/orchestrator.db ".backup /data/backup.db"
docker cp galera-orchestrator:/data/backup.db ./orchestrator-$(date +%Y%m%d).db

# Restore
docker cp ./orchestrator-YYYYMMDD.db galera-orchestrator:/data/orchestrator.db
docker compose restart orchestrator

Changing FERNET_SECRET_KEY makes all encrypted node passwords unreadable. Always back up before key rotation. Re-enter passwords in Settings → Nodes after rotation.

Development

git clone https://github.com/Leg1onary/galera_orchestrator_v2.git
cd galera_orchestrator_v2
cp .env.example .env
# Set: COOKIE_SECURE=false, DOCS_ENABLED=true, fill in secrets

docker compose up -d
# Frontend HMR:  http://localhost:5173
# Backend API:   http://localhost:8000
# Swagger UI:    http://localhost:8000/docs  (DOCS_ENABLED=true required)

Generate a bcrypt hash:

python3 -c "import bcrypt; print(bcrypt.hashpw(b'admin', bcrypt.gensalt(12)).decode())"

Security

Concern	How it's handled
Admin password	bcrypt 12-round hash in `.env`. Plaintext never stored.
JWT	Min 32 chars. Server refuses to start with default.
Node DB passwords	Fernet-encrypted in SQLite. Changing the key invalidates all.
SSH key	Mounted `:ro` — container cannot modify or exfiltrate it.
JWT in browser	`httpOnly` cookie — invisible to JavaScript, XSS-safe.
Dual secrets	`JWT_SECRET_KEY ≠ FERNET_SECRET_KEY` enforced at startup.
HTTP headers	`X-Content-Type-Options`, `X-Frame-Options`, `X-XSS-Protection`, `Referrer-Policy`, `Permissions-Policy`, `Content-Security-Policy` on every response.
API docs	Disabled by default. `/docs` → 404 in production.
No mock mode	All SSH/SQL operations are real. Use a test cluster for experiments.

Pre-production checklist:

[ ] ADMIN_PASSWORD_HASH is a bcrypt hash, not plaintext
[ ] JWT_SECRET_KEY ≥ 32 unique random chars
[ ] FERNET_SECRET_KEY is a valid Fernet key, differs from JWT
[ ] COOKIE_SECURE=true  (behind TLS / reverse-proxy)
[ ] DOCS_ENABLED=false
[ ] .env is chmod 600
[ ] SSH key is chmod 600, no passphrase

Troubleshooting

Panel unreachable after docker compose up

docker compose logs orchestrator
docker compose ps

401 Unauthorized after login

Access the panel on the same host:port as HOST_PORT. httpOnly cookies are not sent cross-origin.

Set COOKIE_SECURE=false for plain HTTP.

Node immediately shows OFFLINE

ssh -i ~/.ssh/id_rsa -p <ssh_port> <user>@<node-host> "hostname"
mysql -h <node-host> -P <db_port> -u <db_user> -p -e "SELECT 1"

Node DB passwords broken after .env change

FERNET_SECRET_KEY was changed — all ciphertexts are unreadable. Go to Settings → Nodes and re-enter each db_password.

WebSocket shows Disconnected in footer

docker compose ps
curl -i -N \
  -H "Upgrade: websocket" -H "Connection: Upgrade" \
  -H "Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==" \
  -H "Sec-WebSocket-Version: 13" \
  http://localhost:8000/ws/clusters/1

Rolling Restart stuck mid-way

Affected node stays in maintenance (read_only=ON). Go to Maintenance → Node Maintenance State → Exit on that node.

IST vs SST Helper recommends SST unexpectedly

Donor's gcache doesn't cover the joiner's seqno gap. Increase gcache.size in wsrep_provider_options (e.g. gcache.size=2G) to make IST available for future rejoins. The current rejoin will proceed via SST — it's safe, just slower.

Split-Brain Wizard: cluster doesn't reform after pc.bootstrap

The non-primary nodes must be restarted after pc.bootstrap=YES is set on the trusted node. The wizard does this automatically, but if it fails mid-step: manually run systemctl restart mariadb on each non-primary node one by one and watch wsrep_cluster_status on each.

Full Cluster Recovery: "no safe_to_bootstrap node found"

All nodes have safe_to_bootstrap: 0 — this happens after a crash where Galera couldn't mark a node safe. The tool will still proceed using the highest seqno node, but shows a warning. Verify the seqno values in the grastate.dat Inspector before confirming.

How to change the admin password

python3 -c "import bcrypt; print(bcrypt.hashpw(b'newpassword', bcrypt.gensalt(12)).decode())"
# Paste the hash into ADMIN_PASSWORD_HASH in .env, then:
docker compose -f ~/galera-orchestrator/docker-compose.ghcr.yml restart

"Check updates" shows "registry unavailable"

Server can't reach ghcr.io — expected on air-gapped networks. Use update.sh manually.

Built for ops teams who need real control, not pretty dashboards.

⬆ Back to top

Name		Name	Last commit message	Last commit date
Latest commit History 468 Commits
.github		.github
backend		backend
frontend		frontend
tests/e2e		tests/e2e
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.ghcr.yml		docker-compose.ghcr.yml
docker-compose.yml		docker-compose.yml
install.sh		install.sh
update.sh		update.sh

Folders and files

Latest commit

History

Repository files navigation

What is this?

Screenshots

What's New

New on Overview

New in Diagnostics — Galera Health group

New on Recovery page — standalone Recovery Tools

New API endpoints

Contents

Quick Start

Prerequisites

Option A — One-line installer (recommended)

Option B — Docker Compose manually

Option C — Build from source

Update

First-run setup

Configuration

SSH Key Setup

Pages & Features

Overview

Nodes

Topology

Recovery

Recovery Wizard

Recovery Tools (new)

Maintenance

Settings

Diagnostics Suite

Analysis

Galera Health (new)

Config

Engine

Activity

Logs & Ops

Smart Advisor

API Reference

Auth

Clusters

Nodes

Diagnostics

Recovery

Maintenance

WebSocket

Architecture

Project Structure

Data & Backup

Development

Security

Troubleshooting

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages