feat: add self-healing mechanism for the shard state by janikdotzel · Pull Request #32886 · akka/akka-core

janikdotzel · 2026-02-10T12:46:56Z

Description

This PR introduces a self-healing mechanism to Akka Cluster Sharding that automatically detects and removes stale shard state when cluster coordination is impaired. This feature allows shards hosted on unreachable regions to be recovered and reallocated without waiting for the node to be fully downed or removed from the cluster.

Problem

When a node hosting shard regions becomes unreachable (but not yet downed), shards remain assigned to that region. Messages destined for those shards are buffered indefinitely until the node is manually downed or handled by a Split Brain Resolver. This latency significantly degrades system availability during network partitions or partial failures.

Changes

Reachability Tracking: The ShardCoordinator now tracks when regions become unreachable via UnreachableMember events.
Periodic Cleanup: A periodic check (SelfHealingTick) identifies regions that have been unreachable for longer than a configured threshold.
Safe Deallocation: Shards on stale regions are proactively deallocated (triggering ShardHomeDeallocated), allowing them to be reallocated to reachable nodes.
Configuration: Added akka.cluster.sharding.self-healing settings to control the behavior.

akka.cluster.sharding.self-healing {
  enabled = off              # Opt-in
  stale-region-timeout = 30s # Threshold for stale regions
  check-interval = 5s        # Frequency of checks
  startup-grace-period = 60s # Grace period on coordinator startup
  dry-run = off              # Log-only mode for safety testing
}

johanandren · 2026-02-10T13:26:01Z

Why I this needed when akka.cluster.split-brain-resolver.stable-after/down-removal-margin already exists (default 20s), alternatively, how is this more "self healing" than that?

Is it to deal with temporary unreachability where the node comes back rather than be dropped by SBR?

If not, can be we look at improving stable-after/down-removal-margin instead of tacking on another separate similar feature?

Edit: Ok I see that it is more like an eager deallocation when unreachable, possibly expecting the node to be downed later. Won't this lead to a lot of shard moves back and forth between nodes in an unstable cluster with temporary unreachability?

janikdotzel · 2026-02-10T14:34:32Z

@johanandren
Your edit is correct. In cases where the unreachable nodes are downed later, we want the shards to be available again sooner.
In scenarios with an unstable cluster one can tune the new stale-region-timeout setting to a higher value or simply deactivate the self-healing feature.

The related issue #32789 was created by @JustinPihony. Maybe he can say more about why this feature was requested.

johanandren · 2026-02-10T14:41:12Z

In scenarios with an unstable cluster one can tune the new stale-region-timeout setting to a higher value or simply deactivate the self-healing feature.

Isn't this feature specifically for such scenarios? If there is no instability, no nodes will be unreachable. If it is for scenarios where a node is taken out of service by SBR, shouldn't speeding that up be by speeding up SBR removal rather?

I thinks this idea requires some more design discussions with the team rather than just implementing it right away.

Ping @patriknw

patriknw · 2026-02-11T10:23:34Z

akka-cluster-sharding/src/main/resources/reference.conf

+
+  # Self-healing configuration for automatic shard state cleanup when cluster coordination is impaired.
+  # When enabled, shards allocated to regions that have been unreachable beyond a configurable threshold
+  # will be automatically deallocated, enabling graceful degradation and recovery.


Looks like this is adding another layer of split brain resolver for shards. We don't want to have multiple ways of handling split brain situations. There is a high risk that the shard is still running on the other side of the network partition. I can't see that we want to add this feature.

JustinPihony · 2026-02-11T23:28:23Z

In scenarios with an unstable cluster one can tune the new stale-region-timeout setting to a higher value or simply deactivate the self-healing feature.

Isn't this feature specifically for such scenarios? If there is no instability, no nodes will be unreachable. If it is for scenarios where a node is taken out of service by SBR, shouldn't speeding that up be by speeding up SBR removal rather?

I thinks this idea requires some more design discussions with the team rather than just implementing it right away.

Ping @patriknw

@leviramsey Can you speak to this please - as you have more history on the background of this (noting that this is a public-facing PR)

feat: add self-healing mechanism for the shard state

becc385

janikdotzel mentioned this pull request Feb 10, 2026

feat: Add self-healing capabilities around shard state consistency #32789

Open

patriknw reviewed Feb 11, 2026

View reviewed changes

janikdotzel closed this Feb 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add self-healing mechanism for the shard state#32886

feat: add self-healing mechanism for the shard state#32886
janikdotzel wants to merge 1 commit intomainfrom
sharding-self-healing

janikdotzel commented Feb 10, 2026

Uh oh!

johanandren commented Feb 10, 2026 •

edited

Loading

Uh oh!

janikdotzel commented Feb 10, 2026

Uh oh!

johanandren commented Feb 10, 2026

Uh oh!

patriknw Feb 11, 2026

Uh oh!

JustinPihony commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

janikdotzel commented Feb 10, 2026

Description

Problem

Changes

Uh oh!

johanandren commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

janikdotzel commented Feb 10, 2026

Uh oh!

johanandren commented Feb 10, 2026

Uh oh!

patriknw Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

JustinPihony commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

johanandren commented Feb 10, 2026 •

edited

Loading