-
Notifications
You must be signed in to change notification settings - Fork 76
Description
First of all, thank you for this project. It's a nice piece of technology.
I note that in the docs for CCR,
You can’t resume replication after it’s been paused for more than 12 hours. You must stop replication, delete the follower index, and restart replication of the leader.
FWIU, this limit is related to the index.soft_deletes.retention_lease.period
to ensure that details of deleted docs are retained so that the follower cluster can replay these from the translog (ref). However, I also note that there exist the retention_lease_max_failure_duration
, with 1h
by default but it's max value is 12h
. I'm wondering which of these two settings is responsible for the reason why replication can't be paused for more than 12 hours and whether this limit can be increased. My teams indices are 100Tb so restarting CCR from scratch is really not ideal.
Separately, we have also encountered cases when we reconfigured and restarted nodes and replication failed without explanation and cannot be resumed replication even though from the time the replication failure occurred to the time we tried to resume replication was within the 12h
window
_plugins/_replication/<index>/_status?pretty
{
"status" : "FAILED",
"reason" : "",
"leader_alias" : "repl_conn",
"leader_index" : "<index>",
"follower_index" : "<index>"
}
We suspect that maybe it was because some _state
directories were lost but we're not sure whether CCR relies on this. Could I get any indication of what this issue might be related to? Some of the logs we found mentioned
[2025-05-13T20:09:24,763][WARN ][o.o.p.PersistentTasksClusterService] [master-us2-1] trying to update state on task replication:[<index>][540] with unexpected allocation id 11421
[2025-05-13T20:09:25,386][WARN ][o.o.p.PersistentTasksClusterService] [master-us2-1] trying to update state on task replication:[<index>][913] with unexpected allocation id 11426