Skip to content

fix: prevent monitor crash when CLUSTER SLOTS times out#46

Merged
hjianbo merged 2 commits into
branch-0.7.xfrom
fix/monitor-refresh-timeout-crash-0.7
Mar 30, 2026
Merged

fix: prevent monitor crash when CLUSTER SLOTS times out#46
hjianbo merged 2 commits into
branch-0.7.xfrom
fix/monitor-refresh-timeout-crash-0.7

Conversation

@hjianbo

@hjianbo hjianbo commented Mar 30, 2026

Copy link
Copy Markdown
Member

Background

This PR addresses a Redis Cluster refresh-timeout crash in eredis_cluster monitor process on the branch-0.7.x line.

Problems Fixed

  1. Monitor crash during slot refresh timeout

    • eredis_cluster_monitor:get_cluster_slots/3 called eredis:q(Connection, ["CLUSTER", "SLOTS"]) directly in a gen_server call path.
    • When eredis:q/2 exits on timeout, the exception could bubble up and terminate the monitor process.
  2. Potential unsafe connection stop in fallback branches

    • In some branches (unknown command / cluster disabled), stop used plain eredis:stop/1.
    • This PR unifies those branches to safe_eredis_stop/1.
  3. Regression test cleanup instability

    • The new timeout regression test unloaded meck before monitor shutdown, producing noisy test-time noproc termination reports.
    • This PR stops monitor explicitly before unload.

Typical Error Logs Covered

These are the key log patterns related to this issue:

Failed to get cluster slots from redis node ...: {error,{cluster_slots_exit,timeout}}
Failed to get cluster slots: {<<"ERR all nodes are down">>, ...}

Before this fix, timeout could additionally lead to monitor termination logs (for example Generic server monitor_* terminating due to refresh path exceptions).

Changes

  • src/eredis_cluster_monitor.erl
    • Add safe_get_cluster_slots/1 to catch exit from eredis:q/2 and convert it into normal {error, ...} flow.
    • Use safe_eredis_stop/1 consistently in all CLUSTER SLOTS branches.
  • test/eredis_cluster_tests.erl
    • Add/keep timeout regression case and stop monitor cleanly in test teardown path.
  • src/eredis_cluster.appup.src
    • Bump appup target to 0.7.9.
    • Add upgrade/downgrade instructions for 0.7.8.

Verification

  • /home/moo/Worksapce/emqx-enterprise/rebar3 as test eunit -g eredis_cluster_tests:monitor_reload_slots_timeout_test_ -v

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens eredis_cluster_monitor against eredis:q(Connection, ["CLUSTER","SLOTS"]) exiting (e.g., timeout) during slot refresh, preventing the monitor gen_server from crashing and adding a regression test for the behavior.

Changes:

  • Wrap CLUSTER SLOTS query in safe_get_cluster_slots/1 to convert exit into an error tuple.
  • Introduce safe_eredis_stop/1 and use it on some slot-fetch paths to reduce crash risk during cleanup.
  • Add an EUnit regression test covering the timeout/exit case during refresh_mapping/2.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/eredis_cluster_monitor.erl Adds safe wrapper for CLUSTER SLOTS and safe stop helper; updates slot refresh path to use them.
test/eredis_cluster_tests.erl Adds regression test to ensure monitor survives CLUSTER SLOTS timeout/exit during refresh.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/eredis_cluster_monitor.erl
Comment thread test/eredis_cluster_tests.erl

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hjianbo hjianbo merged commit e6de290 into branch-0.7.x Mar 30, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants