Skip to content

feat: add self-healing mechanism for the shard state#32886

Closed
janikdotzel wants to merge 1 commit intomainfrom
sharding-self-healing
Closed

feat: add self-healing mechanism for the shard state#32886
janikdotzel wants to merge 1 commit intomainfrom
sharding-self-healing

Conversation

@janikdotzel
Copy link
Member

Description

This PR introduces a self-healing mechanism to Akka Cluster Sharding that automatically detects and removes stale shard state when cluster coordination is impaired. This feature allows shards hosted on unreachable regions to be recovered and reallocated without waiting for the node to be fully downed or removed from the cluster.

Problem

When a node hosting shard regions becomes unreachable (but not yet downed), shards remain assigned to that region. Messages destined for those shards are buffered indefinitely until the node is manually downed or handled by a Split Brain Resolver. This latency significantly degrades system availability during network partitions or partial failures.

Changes

  • Reachability Tracking: The ShardCoordinator now tracks when regions become unreachable via UnreachableMember events.
  • Periodic Cleanup: A periodic check (SelfHealingTick) identifies regions that have been unreachable for longer than a configured threshold.
  • Safe Deallocation: Shards on stale regions are proactively deallocated (triggering ShardHomeDeallocated), allowing them to be reallocated to reachable nodes.
  • Configuration: Added akka.cluster.sharding.self-healing settings to control the behavior.
akka.cluster.sharding.self-healing {
  enabled = off              # Opt-in
  stale-region-timeout = 30s # Threshold for stale regions
  check-interval = 5s        # Frequency of checks
  startup-grace-period = 60s # Grace period on coordinator startup
  dry-run = off              # Log-only mode for safety testing
}

@johanandren
Copy link
Contributor

johanandren commented Feb 10, 2026

Why I this needed when akka.cluster.split-brain-resolver.stable-after/down-removal-margin already exists (default 20s), alternatively, how is this more "self healing" than that?

Is it to deal with temporary unreachability where the node comes back rather than be dropped by SBR?

If not, can be we look at improving stable-after/down-removal-margin instead of tacking on another separate similar feature?

Edit: Ok I see that it is more like an eager deallocation when unreachable, possibly expecting the node to be downed later. Won't this lead to a lot of shard moves back and forth between nodes in an unstable cluster with temporary unreachability?

@janikdotzel
Copy link
Member Author

@johanandren
Your edit is correct. In cases where the unreachable nodes are downed later, we want the shards to be available again sooner.
In scenarios with an unstable cluster one can tune the new stale-region-timeout setting to a higher value or simply deactivate the self-healing feature.

The related issue #32789 was created by @JustinPihony. Maybe he can say more about why this feature was requested.

@johanandren
Copy link
Contributor

In scenarios with an unstable cluster one can tune the new stale-region-timeout setting to a higher value or simply deactivate the self-healing feature.

Isn't this feature specifically for such scenarios? If there is no instability, no nodes will be unreachable. If it is for scenarios where a node is taken out of service by SBR, shouldn't speeding that up be by speeding up SBR removal rather?

I thinks this idea requires some more design discussions with the team rather than just implementing it right away.

Ping @patriknw


# Self-healing configuration for automatic shard state cleanup when cluster coordination is impaired.
# When enabled, shards allocated to regions that have been unreachable beyond a configurable threshold
# will be automatically deallocated, enabling graceful degradation and recovery.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is adding another layer of split brain resolver for shards. We don't want to have multiple ways of handling split brain situations. There is a high risk that the shard is still running on the other side of the network partition. I can't see that we want to add this feature.

@JustinPihony
Copy link
Contributor

In scenarios with an unstable cluster one can tune the new stale-region-timeout setting to a higher value or simply deactivate the self-healing feature.

Isn't this feature specifically for such scenarios? If there is no instability, no nodes will be unreachable. If it is for scenarios where a node is taken out of service by SBR, shouldn't speeding that up be by speeding up SBR removal rather?

I thinks this idea requires some more design discussions with the team rather than just implementing it right away.

Ping @patriknw

@leviramsey Can you speak to this please - as you have more history on the background of this (noting that this is a public-facing PR)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants