Skip to content

Conversation

TheGoldenPlatypus
Copy link
Contributor

@TheGoldenPlatypus TheGoldenPlatypus commented Oct 5, 2025

https://issues.hibernatingrhinos.com/issue/RavenDB-24939/Ansible-add-health-checks-when-version-upgrades-are-being-done


Test matrix:

# ======================================
# scenario: rolling-upgrade-node-alive-ok
# gates used: node_alive
#
# flow:
# 1. setup 3 independent nodes with ravendb 5.4
# 2. try to rolling upgrade to 6.2 serially:
#   - roll A -> B -> C
#   - pre-gate node_alive
#   - upgrade node to 6.2
#   - post-gate node_alive
# 
# checkpoints:
#  - each node must pass node_alive gate before upgrade
#  - each node must pass node_alive gate after upgrade
#
# outcome:
# - happy path. PASS. All nodes upgraded to 6.2
# ======================================
# ======================================
# scenario: rolling-upgrade-node-and-cluster-ok
# gates used: node_alive, cluster_connectivity
#
# flow:
# 1. setup 3 independent nodes with ravendb 5.4
# 2. form a cluster out of them
# 3. try to rolling upgrade to 6.2 serially:
#    - roll A -> B -> C
#    - pre-gate node_alive + cluster_connectivity
#    - upgrade node to 6.2
#    - post-gate node_alive + cluster_connectivity
#
# checkpoints:
# - each node must pass node_alive gate before upgrade
# - each node must pass cluster_connectivity (peer ping) gate before upgrade
# - each node must pass node_alive gate before after upgrade
# - each node must pass cluster_connectivity (peer ping ) gate after upgrade
#
# outcome:
# - PASS. All nodes upgraded to 6.2; cluster healthy at each step
# ======================================
# ======================================
# scenario: rolling-upgrade-all-gates-rf3-ok
# gates used: node_alive, cluster_connectivity, node_databases_online
#
# flow:
# 1. setup 3 independent nodes with ravendb 5.4
# 2. form a cluster out of them
# 3. Create RF=3 database replicated on A/B/C
# 4. try to rolling upgrade to 6.2 serially:
#    - roll A -> B -> C
#    - pre-gate node_alive + cluster_connectivity + node_databases_online
#    - upgrade node to 6.2
#    - post-gate node_alive + cluster_connectivity + + node_databases_online
#
# checkpoints:
# - pre-upgrade: 'node_alive','cluster_connectivity','node_databases_online'
# - post-upgrade: 'node_alive','cluster_connectivity','node_databases_online'
#
# outcome:
# - PASS. All nodes at 6.2; DB remains available throughout
# =========================================
# ======================================
# scenario: db-online-ignores-disabled-dbs
# gates used: gates used: node_alive, cluster_connectivity, node_databases_online
#
# flow:
# 1. setup 3 independent nodes with ravendb 5.4
# 2. form a cluster out of them
# 3. Create RF=3 "ok_db" replicated on A/B/C
# 4. Create RF=3 "bad_db" replicated on A/B/C
# 5. corrupt "bad_db" on A,C
# 6. Disable "bad_db"
# 7. try to rolling upgrade to 6.2 serially:
#    - roll A -> B -> C
#    - pre-gate node_alive + cluster_connectivity + node_databases_online
#    - upgrade node to 6.2
#    - post-gate node_alive + cluster_connectivity + node_databases_online
#
# checkpoints:
# - Gate passes because disabled DBs are skipped
#
# outcome:
# - PASS, all node upgraded
# ======================================
# ======================================
# scenario: db-online-timeout-continue-soft-pass
# gates used: node_alive, cluster_connectivity, node_databases_online (policy=continue on timeout)
#
# flow:
# 1. setup 3 independent nodes with ravendb 5.4
# 2. form a cluster out of them
# 3. Create RF=3 database replicated on A/B/C
# 4. stop ravendb service on A,C
# 5. try to rolling upgrade to 6.2 serially:
#    - roll A -> B -> C
#    - pre-gate node_alive + cluster_connectivity + node_databases_online
#    - upgrade node to 6.2
#    - post-gate node_alive + cluster_connectivity + node_databases_online
#
# checkpoints:
# - timeout on pre-gate node_databases_online on B - no explicit failure, proceed although lack of confirmation
#
# outcome:
# - PASS (soft pass). assert timeout. all nodes upgraded.
# ===============================
# ======================================
# scenario: rolling-upgrade-node-alive-pre-gate-fails-b
# gates used: node_alive
#
# flow:
# 1. setup 3 independent nodes with ravendb 5.4
# 2. Deliberately sabotage node B (stop service)
# 3. try to rolling upgrade to 6.2 serially:
#   - roll A -> B -> C
#   - pre-gate node_alive
#   - upgrade the node to 6.2
#   - post-gate node_alive
#
# checkpoints:
# - A pre-gate passes
# - A post-gate passes
# - B pre-gate fails
# - C pre-gate is never reached
#
# outcome:
# - FAIL before upgrading B. A upgraded to 6.2; B/C remain 5.4
# - we catch the failure of pre-gate on B and stop the run
# ======================================
# ======================================
# scenario: rolling-upgrade-cluster-connectivity-fails-on-b
# gates used: node_alive, cluster_connectivity
#
# flow:
# 1. setup 3 independent nodes with ravendb 5.4
# 2. form a cluster out of them
# 3. break cluster connectivity on B ( break only TCP, HTTP should work for node-alive gate)
# 4. try to rolling upgrade to 6.2 serially:
#    - roll A -> B -> C
#    - pre-gate node_alive + cluster_connectivity
#    - upgrade node to 6.2
#    - post-gate node_alive + cluster_connectivity
#
# checkpoints:
# - A pre-gate node healthy passes
# - A pre-gate cluster_connectivity fails on detecting B is unreachable
#
# outcome:
# - FAIL before upgrading A. A/B/C remain 5.4
# - we catch the failure of pre-gate on A and stop the run
# ======================================
# ======================================
# scenario: degraded-db-placement-on-a-c
# gates used: node_alive, cluster_connectivity, node_databases_online
#
# flow:
# 1. setup 3 independent nodes with ravendb 5.4
# 2. form a cluster out of them
# 3. Create RF=3 database replicated on A/B/C
# 4. corrupt DB on A,C
# 5. try to rolling upgrade to 6.2 serially:
#    - roll A -> B -> C
#    - pre-gate node_alive + cluster_connectivity + node_databases_online
#    - upgrade node to 6.2
#    - post-gate node_alive + cluster_connectivity + node_databases_online
#
# checkpoints:
# - A pre-gate 'node_alive','cluster_connectivity','node_databases_online' passes
# - B pre-gate 'node_alive','cluster_connectivity' passes;
# - B pre-gate 'node_databases_online' fails (only B can serve the DB)
#
# outcome:
# - FAIL before upgrading B. A=6.2, B=5.4, C=5.4
# - we catch the failure of pre-gate on B and stop the run
# ======================================
# ======================================
# scenario: db-online-fail-fast-on-load-error
# gates used: node_alive, cluster_connectivity, node_databases_online
#
# flow:
# 1. setup 3 independent nodes with ravendb 5.4
# 2. form a cluster out of them
# 3. Create RF=3 database replicated on A/B/C
# 4. corrupt DB C
# 5. try to rolling upgrade to 6.2 serially:
#    - roll A -> B -> C
#    - pre-gate node_alive + cluster_connectivity + node_databases_online
#    - upgrade node to 6.2
#    - post-gate node_alive + cluster_connectivity + node_databases_online
#
# checkpoints:
# - A pre-gate 'node_alive','cluster_connectivity' passes
# - A pre-gate should fail fast on node_databases_online 
#
# outcome:
# - FAIL before upgrading A. A=5.4, B=5.4, C=5.4
# ======================================

@TheGoldenPlatypus
Copy link
Contributor Author

Reviewers please notice CI is currently red bc of broken py client:
ModuleNotFoundError: No module named 'ravendb.documents.operations.ai'

@gregolsky gregolsky self-requested a review October 14, 2025 08:47
delegate_to: localhost
ravendb.ravendb.healthcheck:
url: "http://{{ ansible_hostname }}:8080"
checks: ['node_alive', 'cluster_connectivity', 'node_databases_online']
Copy link
Member

@gregolsky gregolsky Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the node_databases_online gate automatically excludes the current node by resolving its tag from topology.

the node that we would need to exclude is actually the next one that we're about to update, not the current node...

But actually since you're doing node_database_online pre upgrade as well, then e.g. before putting C down, we're checking whether A and B has dbs online! Is that right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @hiddenshadow21 that's something we could think about in regards to our healtchecks, to do that dbs check before - could spare us a couple of if statements

Comment on lines +68 to +81
- name: Sabotage B tcp connectivity
hosts: ubuntu-bionic-node-b
become: true
gather_facts: no
tasks:
- name: Ensure iptables present
become: true
apt:
name: iptables
state: present
update_cache: yes

- name: Reject outbound cluster traffic
ansible.builtin.command: iptables -I INPUT -p tcp --dport 38888 -j REJECT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool testing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants