RavenDB-24939: introduce healthcheck module #19

TheGoldenPlatypus · 2025-10-05T08:32:02Z

https://issues.hibernatingrhinos.com/issue/RavenDB-24939/Ansible-add-health-checks-when-version-upgrades-are-being-done

Test matrix:

# ======================================
# scenario: rolling-upgrade-node-alive-ok
# gates used: node_alive
#
# flow:
# 1. setup 3 independent nodes with ravendb 5.4
# 2. try to rolling upgrade to 6.2 serially:
#   - roll A -> B -> C
#   - pre-gate node_alive
#   - upgrade node to 6.2
#   - post-gate node_alive
# 
# checkpoints:
#  - each node must pass node_alive gate before upgrade
#  - each node must pass node_alive gate after upgrade
#
# outcome:
# - happy path. PASS. All nodes upgraded to 6.2
# ======================================

# ======================================
# scenario: rolling-upgrade-node-and-cluster-ok
# gates used: node_alive, cluster_connectivity
#
# flow:
# 1. setup 3 independent nodes with ravendb 5.4
# 2. form a cluster out of them
# 3. try to rolling upgrade to 6.2 serially:
#    - roll A -> B -> C
#    - pre-gate node_alive + cluster_connectivity
#    - upgrade node to 6.2
#    - post-gate node_alive + cluster_connectivity
#
# checkpoints:
# - each node must pass node_alive gate before upgrade
# - each node must pass cluster_connectivity (peer ping) gate before upgrade
# - each node must pass node_alive gate before after upgrade
# - each node must pass cluster_connectivity (peer ping ) gate after upgrade
#
# outcome:
# - PASS. All nodes upgraded to 6.2; cluster healthy at each step
# ======================================

# ======================================
# scenario: rolling-upgrade-all-gates-rf3-ok
# gates used: node_alive, cluster_connectivity, node_databases_online
#
# flow:
# 1. setup 3 independent nodes with ravendb 5.4
# 2. form a cluster out of them
# 3. Create RF=3 database replicated on A/B/C
# 4. try to rolling upgrade to 6.2 serially:
#    - roll A -> B -> C
#    - pre-gate node_alive + cluster_connectivity + node_databases_online
#    - upgrade node to 6.2
#    - post-gate node_alive + cluster_connectivity + + node_databases_online
#
# checkpoints:
# - pre-upgrade: 'node_alive','cluster_connectivity','node_databases_online'
# - post-upgrade: 'node_alive','cluster_connectivity','node_databases_online'
#
# outcome:
# - PASS. All nodes at 6.2; DB remains available throughout
# =========================================

# ======================================
# scenario: db-online-ignores-disabled-dbs
# gates used: gates used: node_alive, cluster_connectivity, node_databases_online
#
# flow:
# 1. setup 3 independent nodes with ravendb 5.4
# 2. form a cluster out of them
# 3. Create RF=3 "ok_db" replicated on A/B/C
# 4. Create RF=3 "bad_db" replicated on A/B/C
# 5. corrupt "bad_db" on A,C
# 6. Disable "bad_db"
# 7. try to rolling upgrade to 6.2 serially:
#    - roll A -> B -> C
#    - pre-gate node_alive + cluster_connectivity + node_databases_online
#    - upgrade node to 6.2
#    - post-gate node_alive + cluster_connectivity + node_databases_online
#
# checkpoints:
# - Gate passes because disabled DBs are skipped
#
# outcome:
# - PASS, all node upgraded
# ======================================

# ======================================
# scenario: db-online-timeout-continue-soft-pass
# gates used: node_alive, cluster_connectivity, node_databases_online (policy=continue on timeout)
#
# flow:
# 1. setup 3 independent nodes with ravendb 5.4
# 2. form a cluster out of them
# 3. Create RF=3 database replicated on A/B/C
# 4. stop ravendb service on A,C
# 5. try to rolling upgrade to 6.2 serially:
#    - roll A -> B -> C
#    - pre-gate node_alive + cluster_connectivity + node_databases_online
#    - upgrade node to 6.2
#    - post-gate node_alive + cluster_connectivity + node_databases_online
#
# checkpoints:
# - timeout on pre-gate node_databases_online on B - no explicit failure, proceed although lack of confirmation
#
# outcome:
# - PASS (soft pass). assert timeout. all nodes upgraded.
# ===============================

# ======================================
# scenario: rolling-upgrade-node-alive-pre-gate-fails-b
# gates used: node_alive
#
# flow:
# 1. setup 3 independent nodes with ravendb 5.4
# 2. Deliberately sabotage node B (stop service)
# 3. try to rolling upgrade to 6.2 serially:
#   - roll A -> B -> C
#   - pre-gate node_alive
#   - upgrade the node to 6.2
#   - post-gate node_alive
#
# checkpoints:
# - A pre-gate passes
# - A post-gate passes
# - B pre-gate fails
# - C pre-gate is never reached
#
# outcome:
# - FAIL before upgrading B. A upgraded to 6.2; B/C remain 5.4
# - we catch the failure of pre-gate on B and stop the run
# ======================================

# ======================================
# scenario: rolling-upgrade-cluster-connectivity-fails-on-b
# gates used: node_alive, cluster_connectivity
#
# flow:
# 1. setup 3 independent nodes with ravendb 5.4
# 2. form a cluster out of them
# 3. break cluster connectivity on B ( break only TCP, HTTP should work for node-alive gate)
# 4. try to rolling upgrade to 6.2 serially:
#    - roll A -> B -> C
#    - pre-gate node_alive + cluster_connectivity
#    - upgrade node to 6.2
#    - post-gate node_alive + cluster_connectivity
#
# checkpoints:
# - A pre-gate node healthy passes
# - A pre-gate cluster_connectivity fails on detecting B is unreachable
#
# outcome:
# - FAIL before upgrading A. A/B/C remain 5.4
# - we catch the failure of pre-gate on A and stop the run
# ======================================

# ======================================
# scenario: degraded-db-placement-on-a-c
# gates used: node_alive, cluster_connectivity, node_databases_online
#
# flow:
# 1. setup 3 independent nodes with ravendb 5.4
# 2. form a cluster out of them
# 3. Create RF=3 database replicated on A/B/C
# 4. corrupt DB on A,C
# 5. try to rolling upgrade to 6.2 serially:
#    - roll A -> B -> C
#    - pre-gate node_alive + cluster_connectivity + node_databases_online
#    - upgrade node to 6.2
#    - post-gate node_alive + cluster_connectivity + node_databases_online
#
# checkpoints:
# - A pre-gate 'node_alive','cluster_connectivity','node_databases_online' passes
# - B pre-gate 'node_alive','cluster_connectivity' passes;
# - B pre-gate 'node_databases_online' fails (only B can serve the DB)
#
# outcome:
# - FAIL before upgrading B. A=6.2, B=5.4, C=5.4
# - we catch the failure of pre-gate on B and stop the run
# ======================================

# ======================================
# scenario: db-online-fail-fast-on-load-error
# gates used: node_alive, cluster_connectivity, node_databases_online
#
# flow:
# 1. setup 3 independent nodes with ravendb 5.4
# 2. form a cluster out of them
# 3. Create RF=3 database replicated on A/B/C
# 4. corrupt DB C
# 5. try to rolling upgrade to 6.2 serially:
#    - roll A -> B -> C
#    - pre-gate node_alive + cluster_connectivity + node_databases_online
#    - upgrade node to 6.2
#    - post-gate node_alive + cluster_connectivity + node_databases_online
#
# checkpoints:
# - A pre-gate 'node_alive','cluster_connectivity' passes
# - A pre-gate should fail fast on node_databases_online 
#
# outcome:
# - FAIL before upgrading A. A=5.4, B=5.4, C=5.4
# ======================================

TheGoldenPlatypus · 2025-10-05T09:34:42Z

Reviewers please notice CI is currently red bc of broken py client:
ModuleNotFoundError: No module named 'ravendb.documents.operations.ai'

gregolsky · 2025-10-16T13:37:44Z

roles/ravendb_node/molecule/rolling-update/tasks/happy-paths/node-cluster-and-db-online-ok.yml

+      delegate_to: localhost
+      ravendb.ravendb.healthcheck:
+        url: "http://{{ ansible_hostname }}:8080"
+        checks: ['node_alive', 'cluster_connectivity', 'node_databases_online']


the node_databases_online gate automatically excludes the current node by resolving its tag from topology.

the node that we would need to exclude is actually the next one that we're about to update, not the current node...

But actually since you're doing node_database_online pre upgrade as well, then e.g. before putting C down, we're checking whether A and B has dbs online! Is that right?

cc @hiddenshadow21 that's something we could think about in regards to our healtchecks, to do that dbs check before - could spare us a couple of if statements

gregolsky · 2025-10-16T13:41:42Z

...ravendb_node/molecule/rolling-update/tasks/failure-paths/cluster-connectivity-fails-on-b.yml

+- name: Sabotage B tcp connectivity
+  hosts: ubuntu-bionic-node-b
+  become: true
+  gather_facts: no
+  tasks:
+    - name: Ensure iptables present
+      become: true
+      apt:
+        name: iptables
+        state: present
+        update_cache: yes
+
+    - name: Reject outbound cluster traffic
+      ansible.builtin.command: iptables -I INPUT -p tcp --dport 38888 -j REJECT


cool testing

RavenDB-24939: introduce healthcheck module

58c991c

gregolsky self-requested a review October 14, 2025 08:47

gregolsky reviewed Oct 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RavenDB-24939: introduce healthcheck module #19

RavenDB-24939: introduce healthcheck module #19

TheGoldenPlatypus commented Oct 5, 2025 •

edited

Loading

Uh oh!

TheGoldenPlatypus commented Oct 5, 2025

Uh oh!

gregolsky Oct 16, 2025 •

edited

Loading

Uh oh!

gregolsky Oct 16, 2025

Uh oh!

gregolsky Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RavenDB-24939: introduce healthcheck module #19

Are you sure you want to change the base?

RavenDB-24939: introduce healthcheck module #19

Conversation

TheGoldenPlatypus commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheGoldenPlatypus commented Oct 5, 2025

Uh oh!

gregolsky Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gregolsky Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

gregolsky Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TheGoldenPlatypus commented Oct 5, 2025 •

edited

Loading

gregolsky Oct 16, 2025 •

edited

Loading