add alternative strategies for disaster scenarios

pm47 · pm47 · commit 10fcea0e207d · 2021-10-27T11:51:35.000+02:00
This PRs adds strategies to handle two typical disaster scenarios: outdated commitments and unhandled exceptions.

Default strategies may be the best choice for smaller loosely administered nodes, while alternative strategies may avoid unnecessary mass force-close (but are reserved for advanced users who closely monitor the node).

Strategies for outdated commitments:
- request the counterparty to close the channel (default).
- if the node was restarted less than 10 min ago, log an error message and stop the node

Strategies for unhandled exceptions:
- local force close of the channel (default)
- log an error message and stop the node

Default settings maintain the same behavior as before.
diff --git a/docs/Advanced.md b/docs/Advanced.md
@@ -0,0 +1,41 @@
+# Advanced usage
+
+## Avoid mass force-close of channels
+
+In order to minimize force-closes of channels (especially for larger nodes), it is possible to customize the way eclair handles certain situations, like outdated commitment and internal errors.
+
+:warning: There is no magic: non-default strategies are a trade-off where it is assumed that the node is closely monitored. Instead of automatically reacting to some events, eclair will stop and await manual intervention. It is therefore reserved for advanced or professional node operators. Default strategies are best suited for smaller loosely administered nodes.
+
+### Outdated commitments
+
+The default behavior, when our peer tells us (or proves to us) that our channel commitment is outdated, is to request a remote force-close of the channel (a.k.a. recovery).
+
+It may happen that due to a misconfiguration, the node was accidentally restarted using e.g. an old backup, and the data wasn't really lost. In that case, simply fixing the configuration and restarting eclair would prevent a mass force-close of channels.
+
+This is why an alternative behavior is to simply log an error and stop the node. However, because our peer may be lying when it tells us that our channel commitment data is outdated, there is a 10 min window after restart when this strategy applies. After that, the node reverts to the default strategy.
+
+During the 10 min window, the operator should closely monitor the node and assess, if the peer stops, whether this is really a case of using outdated data, or a peer is just lying. If it turns out that the data is really outdated due to a misconfiguration, the operator has an opportunity to fix it and restart the node. If the data is really outdated because it was simply lost, then the operator should change the strategy to the default and restart the node: this will cause the force close of outdated channels, but there is no way to avoid that.
+
+Here is a decision tree:
+```
+if (node stops after restart)
+  if (false positive)
+    configure eclair to use default strategy and restart node (will force close channels to malicious peers)
+  else
+    if (more up-to-date data available)
+      configure eclair to point to proper database and restart node
+    else
+      configure eclair to use default strategy and restart node (will force close all outdated channels)
+```
+
+The alternate strategy can be configured by setting `eclair.outdated-commitment-strategy=stop`  (see [`reference.conf`](https://github.com/ACINQ/eclair/blob/master/eclair-core/src/main/resources/reference.conf)).
+
+### Unhandled exceptions
+
+The default behavior, when we encounter an unhandled exception or internal error, is to locally force-close the channel.
+
+Not only is there a delay before the channel balance gets refunded, but if the exception was due to some misconfiguration or bug in eclair that affects all channels, we risk force-closing all channels.
+
+This is why an alternative behavior is to simply log an error and stop the node. Note that if you don't closely monitor your node, there is a risk that your peers take advantage of the downtime to try and cheat by publishing a revoked commitment. Additionally, while there is no known way of triggering an internal error in eclair from the outside, there may very well be a bug that allows just that, which could be used as a way to remotely stop the node (with the default behavior, it would "only" cause a local force-close of the channel).
+
+The alternate strategy can be configured by setting `eclair.unhandled-exception-strategy=stop`  (see [`reference.conf`](https://github.com/ACINQ/eclair/blob/master/eclair-core/src/main/resources/reference.conf)).
diff --git a/docs/release-notes/eclair-vnext.md b/docs/release-notes/eclair-vnext.md
@@ -4,6 +4,14 @@
 
 ## Major changes
 
+### Advanced strategies to avoid mass force-close of channels
+
+In order to minimize force-closes of channels (especially for larger nodes), it is possible to customize the way eclair handles certain situations, like outdated commitment and internal errors.
+
+:warning: There is no magic: non-default strategies are a trade-off where it is assumed that the node is closely monitored. Instead of automatically reacting to some events, eclair will stop and await manual intervention. It is therefore reserved for advanced or professional node operators. Default strategies are best suited for smaller loosely administered nodes.
+
+This feature is documented [here](../Advanced.md).
+
 ### Separate log for important notifications
 
 Eclair added a new log file (`notifications.log`) for important notifications that require an action from the node operator.
diff --git a/eclair-core/src/main/resources/reference.conf b/eclair-core/src/main/resources/reference.conf
@@ -95,6 +95,10 @@ eclair {
   max-block-processing-delay = 30 seconds // we add a random delay before processing blocks, capped at this value, to prevent herd effect
   max-tx-publish-retry-delay = 60 seconds // we add a random delay before retrying failed transaction publication
 
+  // see docs/Advanced.md for more information on the strategies
+  outdated-commitment-strategy = "remote-close" // remote-close or stop (NB: the app will only stop if it was recently restarted)
+  unhandled-exception-strategy = "local-close" // local-close or stop
+
   relay {
     fees {
       // Fees for public channels
diff --git a/eclair-core/src/main/scala/fr/acinq/eclair/Logs.scala b/eclair-core/src/main/scala/fr/acinq/eclair/Logs.scala
@@ -167,6 +167,8 @@ object NotificationsLogger {
    */
   def logFatalError(message: String, t: Throwable): Unit = log.error(message, t)
 
+  def logFatalError(message: String): Unit = log.error(message)
+
   def apply(): Behavior[NotifyNodeOperator] =
     Behaviors.setup { context =>
       context.system.eventStream ! EventStream.Subscribe(context.self)
diff --git a/eclair-core/src/main/scala/fr/acinq/eclair/NodeParams.scala b/eclair-core/src/main/scala/fr/acinq/eclair/NodeParams.scala
@@ -22,6 +22,7 @@ import fr.acinq.bitcoin.{Block, ByteVector32, Crypto, Satoshi}
 import fr.acinq.eclair.Setup.Seeds
 import fr.acinq.eclair.blockchain.fee._
 import fr.acinq.eclair.channel.Channel
+import fr.acinq.eclair.channel.Channel.{OutdatedCommitmentStrategy, UnhandledExceptionStrategy}
 import fr.acinq.eclair.crypto.Noise.KeyPair
 import fr.acinq.eclair.crypto.keymanager.{ChannelKeyManager, NodeKeyManager}
 import fr.acinq.eclair.db._
@@ -76,6 +77,8 @@ case class NodeParams(nodeKeyManager: NodeKeyManager,
                       relayParams: RelayParams,
                       reserveToFundingRatio: Double,
                       maxReserveToFundingRatio: Double,
+                      outdatedCommitmentStrategy: OutdatedCommitmentStrategy,
+                      unhandledExceptionStrategy: UnhandledExceptionStrategy,
                       db: Databases,
                       revocationTimeout: FiniteDuration,
                       autoReconnect: Boolean,
@@ -95,6 +98,9 @@ case class NodeParams(nodeKeyManager: NodeKeyManager,
                       enableTrampolinePayment: Boolean,
                       balanceCheckInterval: FiniteDuration,
                       blockchainWatchdogSources: Seq[String]) {
+
+  val startTime: TimestampSecond = TimestampSecond.now()
+
   val privateKey: Crypto.PrivateKey = nodeKeyManager.nodeKey.privateKey
 
   val nodeId: PublicKey = nodeKeyManager.nodeId
@@ -357,6 +363,16 @@ object NodeParams extends Logging {
       PathFindingExperimentConf(experiments.toMap)
     }
 
+    val outdatedCommitmentStrategy = config.getString("outdated-commitment-strategy") match {
+      case "remote-close" => OutdatedCommitmentStrategy.RemoteClose
+      case "stop" => OutdatedCommitmentStrategy.Stop
+    }
+
+    val unhandledExceptionStrategy = config.getString("unhandled-exception-strategy") match {
+      case "local-close" => UnhandledExceptionStrategy.LocalClose
+      case "stop" => UnhandledExceptionStrategy.Stop
+    }
+
     val routerSyncEncodingType = config.getString("router.sync.encoding-type") match {
       case "uncompressed" => EncodingType.UNCOMPRESSED
       case "zlib" => EncodingType.COMPRESSED_ZLIB
@@ -423,6 +439,8 @@ object NodeParams extends Logging {
       ),
       reserveToFundingRatio = config.getDouble("reserve-to-funding-ratio"),
       maxReserveToFundingRatio = config.getDouble("max-reserve-to-funding-ratio"),
+      outdatedCommitmentStrategy = outdatedCommitmentStrategy,
+      unhandledExceptionStrategy = unhandledExceptionStrategy,
       db = database,
       revocationTimeout = FiniteDuration(config.getDuration("revocation-timeout").getSeconds, TimeUnit.SECONDS),
       autoReconnect = config.getBoolean("auto-reconnect"),
diff --git a/eclair-core/src/main/scala/fr/acinq/eclair/channel/Channel.scala b/eclair-core/src/main/scala/fr/acinq/eclair/channel/Channel.scala
@@ -127,6 +127,38 @@ object Channel {
   /** We don't immediately process [[CurrentBlockCount]] to avoid herd effects */
   case class ProcessCurrentBlockCount(c: CurrentBlockCount)
 
+  // @formatter:off
+  /** What do we do if we detect that our local commitment is outdated. */
+  sealed trait OutdatedCommitmentStrategy
+  object OutdatedCommitmentStrategy {
+    /**
+     * Ask our counterparty to close the channel, whenever our peer proves to us *or* simply tells us (could be lying)
+     * that we are using an outdated commitment.
+     * This may be the best choice for smaller loosely administered nodes.
+     */
+    case object RemoteClose extends OutdatedCommitmentStrategy
+    /**
+     * If the node was just restarted, just log an error and stop the app. The goal is to prevent unwanted mass
+     * force-close of channels if we accidentally restarted the node with an outdated backup. After a few minutes, we
+     * revert to the default behavior of requesting our peer to force close, otherwise this opens a huge attack vector
+     * where any peer can remotely stop our node.
+     * This strategy may be better suited for larger nodes, closely administered.
+     */
+    case object Stop extends OutdatedCommitmentStrategy
+  }
+  // @formatter:on
+
+  // @formatter:off
+  /** What do we do if we have a local unhandled exception. */
+  sealed trait UnhandledExceptionStrategy
+  object UnhandledExceptionStrategy {
+    /** Ask our counterparty to close the channel. This may be the best choice for smaller loosely administered nodes.*/
+    case object LocalClose extends UnhandledExceptionStrategy
+    /** Just log an error and stop the node. May be better for larger nodes, to prevent unwanted mass force-close.*/
+    case object Stop extends UnhandledExceptionStrategy
+  }
+  // @formatter:on
+
 }
 
 class Channel(val nodeParams: NodeParams, val wallet: OnChainChannelFunder, remoteNodeId: PublicKey, blockchain: typed.ActorRef[ZmqWatcher.Command], relayer: ActorRef, txPublisherFactory: Channel.TxPublisherFactory, origin_opt: Option[ActorRef] = None)(implicit ec: ExecutionContext = ExecutionContext.Implicits.global) extends FSM[ChannelState, ChannelData] with FSMDiagnosticActorLogging[ChannelState, ChannelData] {
@@ -1669,6 +1701,7 @@ class Channel(val nodeParams: NodeParams, val wallet: OnChainChannelFunder, remo
         case syncSuccess: SyncResult.Success =>
           var sendQueue = Queue.empty[LightningMessage]
           // normal case, our data is up-to-date
+
           if (channelReestablish.nextLocalCommitmentNumber == 1 && d.commitments.localCommit.index == 0) {
             // If next_local_commitment_number is 1 in both the channel_reestablish it sent and received, then the node MUST retransmit funding_locked, otherwise it MUST NOT
             log.debug("re-sending fundingLocked")
@@ -2288,7 +2321,8 @@ class Channel(val nodeParams: NodeParams, val wallet: OnChainChannelFunder, remo
 
   private def handleLocalError(cause: Throwable, d: ChannelData, msg: Option[Any]) = {
     cause match {
-      case _: ForcedLocalCommit => log.warning(s"force-closing channel at user request")
+      case _: ForcedLocalCommit =>
+        log.warning(s"force-closing channel at user request")
       case _ if msg.exists(_.isInstanceOf[OpenChannel]) || msg.exists(_.isInstanceOf[AcceptChannel]) =>
         // invalid remote channel parameters are logged as warning
         log.warning(s"${cause.getMessage} while processing msg=${msg.getOrElse("n/a").getClass.getSimpleName} in state=$stateName")
@@ -2308,7 +2342,31 @@ class Channel(val nodeParams: NodeParams, val wallet: OnChainChannelFunder, remo
         log.info(s"we have a valid closing tx, publishing it instead of our commitment: closingTxId=${bestUnpublishedClosingTx.tx.txid}")
         // if we were in the process of closing and already received a closing sig from the counterparty, it's always better to use that
         handleMutualClose(bestUnpublishedClosingTx, Left(negotiating))
-      case dd: HasCommitments => spendLocalCurrent(dd) sending error // otherwise we use our current commitment
+      case dd: HasCommitments =>
+        cause match {
+          case _: ChannelException =>
+            // known channel exception: we force close using our current commitment
+            spendLocalCurrent(dd) sending error
+          case _ =>
+            // unhandled exception: we apply the configured strategy
+            nodeParams.unhandledExceptionStrategy match {
+              case UnhandledExceptionStrategy.LocalClose =>
+                spendLocalCurrent(dd) sending error
+              case UnhandledExceptionStrategy.Stop =>
+                log.error("unhandled exception: standard procedure would be to force-close the channel, but eclair has been configured to halt instead.")
+                NotificationsLogger.logFatalError(
+                  s"""stopping node as configured strategy to unhandled exceptions for nodeId=$remoteNodeId channelId=${d.channelId}
+                     |
+                     |Eclair has been configured to shut down when an unhandled exception happens, instead of requesting a
+                     |force-close from the peer. This gives the operator a chance of avoiding an unnecessary mass force-close
+                     |of channels that may be caused by a bug in Eclair, or issues like running out of disk space, etc.
+                     |
+                     |You should get in touch with Eclair developers and provide logs of your node for analysis.
+                     |""".stripMargin)
+                System.exit(1)
+                stop(FSM.Shutdown)
+            }
+        }
       case _ => goto(CLOSED) sending error // when there is no commitment yet, we just send an error to our peer and go to CLOSED state
     }
   }
@@ -2561,9 +2619,28 @@ class Channel(val nodeParams: NodeParams, val wallet: OnChainChannelFunder, remo
   }
 
   private def handleOutdatedCommitment(channelReestablish: ChannelReestablish, d: HasCommitments) = {
-    val exc = PleasePublishYourCommitment(d.channelId)
-    val error = Error(d.channelId, exc.getMessage)
-    goto(WAIT_FOR_REMOTE_PUBLISH_FUTURE_COMMITMENT) using DATA_WAIT_FOR_REMOTE_PUBLISH_FUTURE_COMMITMENT(d.commitments, channelReestablish) storing() sending error
+    nodeParams.outdatedCommitmentStrategy match {
+      case OutdatedCommitmentStrategy.Stop if (TimestampSecond.now() - nodeParams.startTime) < 10.minutes =>
+        log.error("we just restarted and may have an outdated commitment: standard procedure would be to request our peer to force-close, but eclair has been configured to halt instead. Please ensure your database is up-to-date and restart eclair.")
+        NotificationsLogger.logFatalError(
+          s"""stopping node as configured strategy to outdated commitment for nodeId=$remoteNodeId channelId=${d.channelId}
+             |
+             |Eclair has been configured to shut down if a sync error is detected at restart, instead of requesting a
+             |force-close from the peer. This gives the operator a chance of avoiding an unnecessary mass force-close
+             |of channels.
+             |
+             |You should investigate why Eclair appears to be using outdated data. If it turns out that this is due to a
+             |misconfiguration, just fix it and restart the node. If however the data was really lost, then you should
+             |change the outdated commitment strategy to the default and restart the node: this will cause a force
+             |close of outdated channels, but there is no way to avoid that.
+             |""".stripMargin)
+        System.exit(1)
+        stop(FSM.Shutdown)
+      case _ =>
+        val exc = PleasePublishYourCommitment(d.channelId)
+        val error = Error(d.channelId, exc.getMessage)
+        goto(WAIT_FOR_REMOTE_PUBLISH_FUTURE_COMMITMENT) using DATA_WAIT_FOR_REMOTE_PUBLISH_FUTURE_COMMITMENT(d.commitments, channelReestablish) storing() sending error
+    }
   }
 
   /**
diff --git a/eclair-core/src/main/scala/fr/acinq/eclair/channel/Helpers.scala b/eclair-core/src/main/scala/fr/acinq/eclair/channel/Helpers.scala
@@ -324,7 +324,7 @@ object Helpers {
       case class LocalLateUnproven(ourRemoteCommitmentNumber: Long, theirLocalCommitmentNumber: Long) extends Failure
       case class RemoteLying(ourLocalCommitmentNumber: Long, theirRemoteCommitmentNumber: Long, invalidPerCommitmentSecret: PrivateKey) extends Failure
       case object RemoteLate extends Failure
-  }
+    }
     // @formatter:on
 
     /**
diff --git a/eclair-core/src/test/scala/fr/acinq/eclair/TestConstants.scala b/eclair-core/src/test/scala/fr/acinq/eclair/TestConstants.scala
@@ -20,6 +20,7 @@ import fr.acinq.bitcoin.{Block, ByteVector32, Satoshi, SatoshiLong, Script}
 import fr.acinq.eclair.FeatureSupport.{Mandatory, Optional}
 import fr.acinq.eclair.Features._
 import fr.acinq.eclair.blockchain.fee._
+import fr.acinq.eclair.channel.Channel.{OutdatedCommitmentStrategy, UnhandledExceptionStrategy}
 import fr.acinq.eclair.channel.LocalParams
 import fr.acinq.eclair.crypto.keymanager.{LocalChannelKeyManager, LocalNodeKeyManager}
 import fr.acinq.eclair.io.{Peer, PeerConnection}
@@ -143,6 +144,8 @@ object TestConstants {
           feeProportionalMillionths = 30)),
       reserveToFundingRatio = 0.01, // note: not used (overridden below)
       maxReserveToFundingRatio = 0.05,
+      outdatedCommitmentStrategy = OutdatedCommitmentStrategy.RemoteClose,
+      unhandledExceptionStrategy = UnhandledExceptionStrategy.LocalClose,
       db = TestDatabases.inMemoryDb(),
       revocationTimeout = 20 seconds,
       autoReconnect = false,
@@ -269,6 +272,8 @@ object TestConstants {
           feeProportionalMillionths = 30)),
       reserveToFundingRatio = 0.01, // note: not used (overridden below)
       maxReserveToFundingRatio = 0.05,
+      outdatedCommitmentStrategy = OutdatedCommitmentStrategy.RemoteClose,
+      unhandledExceptionStrategy = UnhandledExceptionStrategy.LocalClose,
       db = TestDatabases.inMemoryDb(),
       revocationTimeout = 20 seconds,
       autoReconnect = false,