Skip to content

Commit 10fcea0

Browse files
committed
add alternative strategies for disaster scenarios
This PRs adds strategies to handle two typical disaster scenarios: outdated commitments and unhandled exceptions. Default strategies may be the best choice for smaller loosely administered nodes, while alternative strategies may avoid unnecessary mass force-close (but are reserved for advanced users who closely monitor the node). Strategies for outdated commitments: - request the counterparty to close the channel (default). - if the node was restarted less than 10 min ago, log an error message and stop the node Strategies for unhandled exceptions: - local force close of the channel (default) - log an error message and stop the node Default settings maintain the same behavior as before.
1 parent 2c0c24e commit 10fcea0

File tree

8 files changed

+161
-6
lines changed

8 files changed

+161
-6
lines changed

docs/Advanced.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# Advanced usage
2+
3+
## Avoid mass force-close of channels
4+
5+
In order to minimize force-closes of channels (especially for larger nodes), it is possible to customize the way eclair handles certain situations, like outdated commitment and internal errors.
6+
7+
:warning: There is no magic: non-default strategies are a trade-off where it is assumed that the node is closely monitored. Instead of automatically reacting to some events, eclair will stop and await manual intervention. It is therefore reserved for advanced or professional node operators. Default strategies are best suited for smaller loosely administered nodes.
8+
9+
### Outdated commitments
10+
11+
The default behavior, when our peer tells us (or proves to us) that our channel commitment is outdated, is to request a remote force-close of the channel (a.k.a. recovery).
12+
13+
It may happen that due to a misconfiguration, the node was accidentally restarted using e.g. an old backup, and the data wasn't really lost. In that case, simply fixing the configuration and restarting eclair would prevent a mass force-close of channels.
14+
15+
This is why an alternative behavior is to simply log an error and stop the node. However, because our peer may be lying when it tells us that our channel commitment data is outdated, there is a 10 min window after restart when this strategy applies. After that, the node reverts to the default strategy.
16+
17+
During the 10 min window, the operator should closely monitor the node and assess, if the peer stops, whether this is really a case of using outdated data, or a peer is just lying. If it turns out that the data is really outdated due to a misconfiguration, the operator has an opportunity to fix it and restart the node. If the data is really outdated because it was simply lost, then the operator should change the strategy to the default and restart the node: this will cause the force close of outdated channels, but there is no way to avoid that.
18+
19+
Here is a decision tree:
20+
```
21+
if (node stops after restart)
22+
if (false positive)
23+
configure eclair to use default strategy and restart node (will force close channels to malicious peers)
24+
else
25+
if (more up-to-date data available)
26+
configure eclair to point to proper database and restart node
27+
else
28+
configure eclair to use default strategy and restart node (will force close all outdated channels)
29+
```
30+
31+
The alternate strategy can be configured by setting `eclair.outdated-commitment-strategy=stop` (see [`reference.conf`](https://github.com/ACINQ/eclair/blob/master/eclair-core/src/main/resources/reference.conf)).
32+
33+
### Unhandled exceptions
34+
35+
The default behavior, when we encounter an unhandled exception or internal error, is to locally force-close the channel.
36+
37+
Not only is there a delay before the channel balance gets refunded, but if the exception was due to some misconfiguration or bug in eclair that affects all channels, we risk force-closing all channels.
38+
39+
This is why an alternative behavior is to simply log an error and stop the node. Note that if you don't closely monitor your node, there is a risk that your peers take advantage of the downtime to try and cheat by publishing a revoked commitment. Additionally, while there is no known way of triggering an internal error in eclair from the outside, there may very well be a bug that allows just that, which could be used as a way to remotely stop the node (with the default behavior, it would "only" cause a local force-close of the channel).
40+
41+
The alternate strategy can be configured by setting `eclair.unhandled-exception-strategy=stop` (see [`reference.conf`](https://github.com/ACINQ/eclair/blob/master/eclair-core/src/main/resources/reference.conf)).

docs/release-notes/eclair-vnext.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,14 @@
44

55
## Major changes
66

7+
### Advanced strategies to avoid mass force-close of channels
8+
9+
In order to minimize force-closes of channels (especially for larger nodes), it is possible to customize the way eclair handles certain situations, like outdated commitment and internal errors.
10+
11+
:warning: There is no magic: non-default strategies are a trade-off where it is assumed that the node is closely monitored. Instead of automatically reacting to some events, eclair will stop and await manual intervention. It is therefore reserved for advanced or professional node operators. Default strategies are best suited for smaller loosely administered nodes.
12+
13+
This feature is documented [here](../Advanced.md).
14+
715
### Separate log for important notifications
816

917
Eclair added a new log file (`notifications.log`) for important notifications that require an action from the node operator.

eclair-core/src/main/resources/reference.conf

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,10 @@ eclair {
9595
max-block-processing-delay = 30 seconds // we add a random delay before processing blocks, capped at this value, to prevent herd effect
9696
max-tx-publish-retry-delay = 60 seconds // we add a random delay before retrying failed transaction publication
9797

98+
// see docs/Advanced.md for more information on the strategies
99+
outdated-commitment-strategy = "remote-close" // remote-close or stop (NB: the app will only stop if it was recently restarted)
100+
unhandled-exception-strategy = "local-close" // local-close or stop
101+
98102
relay {
99103
fees {
100104
// Fees for public channels

eclair-core/src/main/scala/fr/acinq/eclair/Logs.scala

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -167,6 +167,8 @@ object NotificationsLogger {
167167
*/
168168
def logFatalError(message: String, t: Throwable): Unit = log.error(message, t)
169169

170+
def logFatalError(message: String): Unit = log.error(message)
171+
170172
def apply(): Behavior[NotifyNodeOperator] =
171173
Behaviors.setup { context =>
172174
context.system.eventStream ! EventStream.Subscribe(context.self)

eclair-core/src/main/scala/fr/acinq/eclair/NodeParams.scala

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ import fr.acinq.bitcoin.{Block, ByteVector32, Crypto, Satoshi}
2222
import fr.acinq.eclair.Setup.Seeds
2323
import fr.acinq.eclair.blockchain.fee._
2424
import fr.acinq.eclair.channel.Channel
25+
import fr.acinq.eclair.channel.Channel.{OutdatedCommitmentStrategy, UnhandledExceptionStrategy}
2526
import fr.acinq.eclair.crypto.Noise.KeyPair
2627
import fr.acinq.eclair.crypto.keymanager.{ChannelKeyManager, NodeKeyManager}
2728
import fr.acinq.eclair.db._
@@ -76,6 +77,8 @@ case class NodeParams(nodeKeyManager: NodeKeyManager,
7677
relayParams: RelayParams,
7778
reserveToFundingRatio: Double,
7879
maxReserveToFundingRatio: Double,
80+
outdatedCommitmentStrategy: OutdatedCommitmentStrategy,
81+
unhandledExceptionStrategy: UnhandledExceptionStrategy,
7982
db: Databases,
8083
revocationTimeout: FiniteDuration,
8184
autoReconnect: Boolean,
@@ -95,6 +98,9 @@ case class NodeParams(nodeKeyManager: NodeKeyManager,
9598
enableTrampolinePayment: Boolean,
9699
balanceCheckInterval: FiniteDuration,
97100
blockchainWatchdogSources: Seq[String]) {
101+
102+
val startTime: TimestampSecond = TimestampSecond.now()
103+
98104
val privateKey: Crypto.PrivateKey = nodeKeyManager.nodeKey.privateKey
99105

100106
val nodeId: PublicKey = nodeKeyManager.nodeId
@@ -357,6 +363,16 @@ object NodeParams extends Logging {
357363
PathFindingExperimentConf(experiments.toMap)
358364
}
359365

366+
val outdatedCommitmentStrategy = config.getString("outdated-commitment-strategy") match {
367+
case "remote-close" => OutdatedCommitmentStrategy.RemoteClose
368+
case "stop" => OutdatedCommitmentStrategy.Stop
369+
}
370+
371+
val unhandledExceptionStrategy = config.getString("unhandled-exception-strategy") match {
372+
case "local-close" => UnhandledExceptionStrategy.LocalClose
373+
case "stop" => UnhandledExceptionStrategy.Stop
374+
}
375+
360376
val routerSyncEncodingType = config.getString("router.sync.encoding-type") match {
361377
case "uncompressed" => EncodingType.UNCOMPRESSED
362378
case "zlib" => EncodingType.COMPRESSED_ZLIB
@@ -423,6 +439,8 @@ object NodeParams extends Logging {
423439
),
424440
reserveToFundingRatio = config.getDouble("reserve-to-funding-ratio"),
425441
maxReserveToFundingRatio = config.getDouble("max-reserve-to-funding-ratio"),
442+
outdatedCommitmentStrategy = outdatedCommitmentStrategy,
443+
unhandledExceptionStrategy = unhandledExceptionStrategy,
426444
db = database,
427445
revocationTimeout = FiniteDuration(config.getDuration("revocation-timeout").getSeconds, TimeUnit.SECONDS),
428446
autoReconnect = config.getBoolean("auto-reconnect"),

eclair-core/src/main/scala/fr/acinq/eclair/channel/Channel.scala

Lines changed: 82 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,38 @@ object Channel {
127127
/** We don't immediately process [[CurrentBlockCount]] to avoid herd effects */
128128
case class ProcessCurrentBlockCount(c: CurrentBlockCount)
129129

130+
// @formatter:off
131+
/** What do we do if we detect that our local commitment is outdated. */
132+
sealed trait OutdatedCommitmentStrategy
133+
object OutdatedCommitmentStrategy {
134+
/**
135+
* Ask our counterparty to close the channel, whenever our peer proves to us *or* simply tells us (could be lying)
136+
* that we are using an outdated commitment.
137+
* This may be the best choice for smaller loosely administered nodes.
138+
*/
139+
case object RemoteClose extends OutdatedCommitmentStrategy
140+
/**
141+
* If the node was just restarted, just log an error and stop the app. The goal is to prevent unwanted mass
142+
* force-close of channels if we accidentally restarted the node with an outdated backup. After a few minutes, we
143+
* revert to the default behavior of requesting our peer to force close, otherwise this opens a huge attack vector
144+
* where any peer can remotely stop our node.
145+
* This strategy may be better suited for larger nodes, closely administered.
146+
*/
147+
case object Stop extends OutdatedCommitmentStrategy
148+
}
149+
// @formatter:on
150+
151+
// @formatter:off
152+
/** What do we do if we have a local unhandled exception. */
153+
sealed trait UnhandledExceptionStrategy
154+
object UnhandledExceptionStrategy {
155+
/** Ask our counterparty to close the channel. This may be the best choice for smaller loosely administered nodes.*/
156+
case object LocalClose extends UnhandledExceptionStrategy
157+
/** Just log an error and stop the node. May be better for larger nodes, to prevent unwanted mass force-close.*/
158+
case object Stop extends UnhandledExceptionStrategy
159+
}
160+
// @formatter:on
161+
130162
}
131163

132164
class Channel(val nodeParams: NodeParams, val wallet: OnChainChannelFunder, remoteNodeId: PublicKey, blockchain: typed.ActorRef[ZmqWatcher.Command], relayer: ActorRef, txPublisherFactory: Channel.TxPublisherFactory, origin_opt: Option[ActorRef] = None)(implicit ec: ExecutionContext = ExecutionContext.Implicits.global) extends FSM[ChannelState, ChannelData] with FSMDiagnosticActorLogging[ChannelState, ChannelData] {
@@ -1669,6 +1701,7 @@ class Channel(val nodeParams: NodeParams, val wallet: OnChainChannelFunder, remo
16691701
case syncSuccess: SyncResult.Success =>
16701702
var sendQueue = Queue.empty[LightningMessage]
16711703
// normal case, our data is up-to-date
1704+
16721705
if (channelReestablish.nextLocalCommitmentNumber == 1 && d.commitments.localCommit.index == 0) {
16731706
// If next_local_commitment_number is 1 in both the channel_reestablish it sent and received, then the node MUST retransmit funding_locked, otherwise it MUST NOT
16741707
log.debug("re-sending fundingLocked")
@@ -2288,7 +2321,8 @@ class Channel(val nodeParams: NodeParams, val wallet: OnChainChannelFunder, remo
22882321

22892322
private def handleLocalError(cause: Throwable, d: ChannelData, msg: Option[Any]) = {
22902323
cause match {
2291-
case _: ForcedLocalCommit => log.warning(s"force-closing channel at user request")
2324+
case _: ForcedLocalCommit =>
2325+
log.warning(s"force-closing channel at user request")
22922326
case _ if msg.exists(_.isInstanceOf[OpenChannel]) || msg.exists(_.isInstanceOf[AcceptChannel]) =>
22932327
// invalid remote channel parameters are logged as warning
22942328
log.warning(s"${cause.getMessage} while processing msg=${msg.getOrElse("n/a").getClass.getSimpleName} in state=$stateName")
@@ -2308,7 +2342,31 @@ class Channel(val nodeParams: NodeParams, val wallet: OnChainChannelFunder, remo
23082342
log.info(s"we have a valid closing tx, publishing it instead of our commitment: closingTxId=${bestUnpublishedClosingTx.tx.txid}")
23092343
// if we were in the process of closing and already received a closing sig from the counterparty, it's always better to use that
23102344
handleMutualClose(bestUnpublishedClosingTx, Left(negotiating))
2311-
case dd: HasCommitments => spendLocalCurrent(dd) sending error // otherwise we use our current commitment
2345+
case dd: HasCommitments =>
2346+
cause match {
2347+
case _: ChannelException =>
2348+
// known channel exception: we force close using our current commitment
2349+
spendLocalCurrent(dd) sending error
2350+
case _ =>
2351+
// unhandled exception: we apply the configured strategy
2352+
nodeParams.unhandledExceptionStrategy match {
2353+
case UnhandledExceptionStrategy.LocalClose =>
2354+
spendLocalCurrent(dd) sending error
2355+
case UnhandledExceptionStrategy.Stop =>
2356+
log.error("unhandled exception: standard procedure would be to force-close the channel, but eclair has been configured to halt instead.")
2357+
NotificationsLogger.logFatalError(
2358+
s"""stopping node as configured strategy to unhandled exceptions for nodeId=$remoteNodeId channelId=${d.channelId}
2359+
|
2360+
|Eclair has been configured to shut down when an unhandled exception happens, instead of requesting a
2361+
|force-close from the peer. This gives the operator a chance of avoiding an unnecessary mass force-close
2362+
|of channels that may be caused by a bug in Eclair, or issues like running out of disk space, etc.
2363+
|
2364+
|You should get in touch with Eclair developers and provide logs of your node for analysis.
2365+
|""".stripMargin)
2366+
System.exit(1)
2367+
stop(FSM.Shutdown)
2368+
}
2369+
}
23122370
case _ => goto(CLOSED) sending error // when there is no commitment yet, we just send an error to our peer and go to CLOSED state
23132371
}
23142372
}
@@ -2561,9 +2619,28 @@ class Channel(val nodeParams: NodeParams, val wallet: OnChainChannelFunder, remo
25612619
}
25622620

25632621
private def handleOutdatedCommitment(channelReestablish: ChannelReestablish, d: HasCommitments) = {
2564-
val exc = PleasePublishYourCommitment(d.channelId)
2565-
val error = Error(d.channelId, exc.getMessage)
2566-
goto(WAIT_FOR_REMOTE_PUBLISH_FUTURE_COMMITMENT) using DATA_WAIT_FOR_REMOTE_PUBLISH_FUTURE_COMMITMENT(d.commitments, channelReestablish) storing() sending error
2622+
nodeParams.outdatedCommitmentStrategy match {
2623+
case OutdatedCommitmentStrategy.Stop if (TimestampSecond.now() - nodeParams.startTime) < 10.minutes =>
2624+
log.error("we just restarted and may have an outdated commitment: standard procedure would be to request our peer to force-close, but eclair has been configured to halt instead. Please ensure your database is up-to-date and restart eclair.")
2625+
NotificationsLogger.logFatalError(
2626+
s"""stopping node as configured strategy to outdated commitment for nodeId=$remoteNodeId channelId=${d.channelId}
2627+
|
2628+
|Eclair has been configured to shut down if a sync error is detected at restart, instead of requesting a
2629+
|force-close from the peer. This gives the operator a chance of avoiding an unnecessary mass force-close
2630+
|of channels.
2631+
|
2632+
|You should investigate why Eclair appears to be using outdated data. If it turns out that this is due to a
2633+
|misconfiguration, just fix it and restart the node. If however the data was really lost, then you should
2634+
|change the outdated commitment strategy to the default and restart the node: this will cause a force
2635+
|close of outdated channels, but there is no way to avoid that.
2636+
|""".stripMargin)
2637+
System.exit(1)
2638+
stop(FSM.Shutdown)
2639+
case _ =>
2640+
val exc = PleasePublishYourCommitment(d.channelId)
2641+
val error = Error(d.channelId, exc.getMessage)
2642+
goto(WAIT_FOR_REMOTE_PUBLISH_FUTURE_COMMITMENT) using DATA_WAIT_FOR_REMOTE_PUBLISH_FUTURE_COMMITMENT(d.commitments, channelReestablish) storing() sending error
2643+
}
25672644
}
25682645

25692646
/**

eclair-core/src/main/scala/fr/acinq/eclair/channel/Helpers.scala

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -324,7 +324,7 @@ object Helpers {
324324
case class LocalLateUnproven(ourRemoteCommitmentNumber: Long, theirLocalCommitmentNumber: Long) extends Failure
325325
case class RemoteLying(ourLocalCommitmentNumber: Long, theirRemoteCommitmentNumber: Long, invalidPerCommitmentSecret: PrivateKey) extends Failure
326326
case object RemoteLate extends Failure
327-
}
327+
}
328328
// @formatter:on
329329

330330
/**

eclair-core/src/test/scala/fr/acinq/eclair/TestConstants.scala

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ import fr.acinq.bitcoin.{Block, ByteVector32, Satoshi, SatoshiLong, Script}
2020
import fr.acinq.eclair.FeatureSupport.{Mandatory, Optional}
2121
import fr.acinq.eclair.Features._
2222
import fr.acinq.eclair.blockchain.fee._
23+
import fr.acinq.eclair.channel.Channel.{OutdatedCommitmentStrategy, UnhandledExceptionStrategy}
2324
import fr.acinq.eclair.channel.LocalParams
2425
import fr.acinq.eclair.crypto.keymanager.{LocalChannelKeyManager, LocalNodeKeyManager}
2526
import fr.acinq.eclair.io.{Peer, PeerConnection}
@@ -143,6 +144,8 @@ object TestConstants {
143144
feeProportionalMillionths = 30)),
144145
reserveToFundingRatio = 0.01, // note: not used (overridden below)
145146
maxReserveToFundingRatio = 0.05,
147+
outdatedCommitmentStrategy = OutdatedCommitmentStrategy.RemoteClose,
148+
unhandledExceptionStrategy = UnhandledExceptionStrategy.LocalClose,
146149
db = TestDatabases.inMemoryDb(),
147150
revocationTimeout = 20 seconds,
148151
autoReconnect = false,
@@ -269,6 +272,8 @@ object TestConstants {
269272
feeProportionalMillionths = 30)),
270273
reserveToFundingRatio = 0.01, // note: not used (overridden below)
271274
maxReserveToFundingRatio = 0.05,
275+
outdatedCommitmentStrategy = OutdatedCommitmentStrategy.RemoteClose,
276+
unhandledExceptionStrategy = UnhandledExceptionStrategy.LocalClose,
272277
db = TestDatabases.inMemoryDb(),
273278
revocationTimeout = 20 seconds,
274279
autoReconnect = false,

0 commit comments

Comments
 (0)