Do not broadcast commitment txn on Permanent mon update failure

TheBlueMatt · TheBlueMatt · commit e6966ce99719 · 2021-10-05T06:02:35.000Z
See doc updates for more info on the edge case this prevents, and
there isn't really a strong reason why we would need to broadcast
the latest state immediately. Specifically, in the case of HTLC
claims (the most important reason to ensure we have state on chain
if it cannot be persisted), we will still force-close if there are
HTLCs which need claiming and are going to expire.

Surprisingly, there were no tests which failed as a result of this
change, but a new one has been added.
diff --git a/lightning/src/chain/channelmonitor.rs b/lightning/src/chain/channelmonitor.rs
@@ -123,56 +123,67 @@ pub enum ChannelMonitorUpdateErr {
 	///
 	/// Such a failure will "freeze" a channel, preventing us from revoking old states or
 	/// submitting new commitment transactions to the counterparty. Once the update(s) which failed
-	/// have been successfully applied, ChannelManager::channel_monitor_updated can be used to
+	/// have been successfully applied, [`ChannelManager::channel_monitor_updated`] can be used to
 	/// restore the channel to an operational state.
 	///
-	/// Note that a given ChannelManager will *never* re-generate a given ChannelMonitorUpdate. If
-	/// you return a TemporaryFailure you must ensure that it is written to disk safely before
-	/// writing out the latest ChannelManager state.
+	/// Note that a given [`ChannelManager`] will *never* re-generate a [`ChannelMonitorUpdate`].
+	/// If you return this error you must ensure that it is written to disk safely before writing
+	/// the latest [`ChannelManager`] state, or you should return [`PermanentFailure`] instead.
 	///
-	/// Even when a channel has been "frozen" updates to the ChannelMonitor can continue to occur
-	/// (eg if an inbound HTLC which we forwarded was claimed upstream resulting in us attempting
-	/// to claim it on this channel) and those updates must be applied wherever they can be. At
-	/// least one such updated ChannelMonitor must be persisted otherwise PermanentFailure should
-	/// be returned to get things on-chain ASAP using only the in-memory copy. Obviously updates to
-	/// the channel which would invalidate previous ChannelMonitors are not made when a channel has
-	/// been "frozen".
+	/// Even when a channel has been "frozen" updates to the [`ChannelMonitor`] can continue to
+	/// occur (e.g. if an inbound HTLC which we forwarded was claimed upstream resulting in us
+	/// attempting to claim it on this channel) and those updates must still be persisted.
 	///
-	/// Note that even if updates made after TemporaryFailure succeed you must still call
-	/// channel_monitor_updated to ensure you have the latest monitor and re-enable normal channel
-	/// operation.
-	///
-	/// Note that the update being processed here will not be replayed for you when you call
-	/// ChannelManager::channel_monitor_updated, so you must store the update itself along
-	/// with the persisted ChannelMonitor on your own local disk prior to returning a
-	/// TemporaryFailure. You may, of course, employ a journaling approach, storing only the
-	/// ChannelMonitorUpdate on disk without updating the monitor itself, replaying the journal at
-	/// reload-time.
+	/// No updates to the channel will be made which could invalidate other [`ChannelMonitor`]s
+	/// until [`ChannelManager::channel_monitor_updated`] is called, even if you return no error on
+	/// a later monitor update for the same channel.
 	///
 	/// For deployments where a copy of ChannelMonitors and other local state are backed up in a
 	/// remote location (with local copies persisted immediately), it is anticipated that all
 	/// updates will return TemporaryFailure until the remote copies could be updated.
+	///
+	/// [`PermanentFailure`]: ChannelMonitorUpdateErr::PermanentFailure
+	/// [`ChannelManager`]: crate::ln::channelmanager::ChannelManager
+	/// [`ChannelManager::channel_monitor_updated`]: crate::ln::channelmanager::ChannelManager::channel_monitor_updated
 	TemporaryFailure,
-	/// Used to indicate no further channel monitor updates will be allowed (eg we've moved on to a
-	/// different watchtower and cannot update with all watchtowers that were previously informed
-	/// of this channel).
+	/// Used to indicate no further channel monitor updates will be allowed (likely a disk failure
+	/// or, e.g. we've moved on to a different watchtower and cannot update with all watchtowers
+	/// that were previously informed of this channel).
+	///
+	/// When this is returned [`ChannelManager`] will force-close the channel but *not* broadcast
+	/// our current commitment transaction. This avoids a dangerous case where a local disk failure
+	/// (e.g. the Linux-default remounting of the disk as read-only) causes [`PermanentFailure`]s
+	/// for all monitor updates. If we were to broadcast our latest commitment transaction and then
+	/// restart, we could end up reading a previous [`ChannelMonitor`] and [`ChannelManager`],
+	/// revoking our now-broadcasted state before seeing it confirm and losing all our funds.
+	///
+	/// Note that this is somewhat of a tradeoff - if the disk is really gone and we may have lost
+	/// the data permanently, we really should broadcast immediately. If the data can be recovered
+	/// with manual intervention, we'd rather close the channel, rejecting future updates to it,
+	/// and broadcast the latest state only if we have HTLCs to claim which are timing out (which
+	/// we do as long as blocks are connected).
 	///
-	/// At reception of this error, ChannelManager will force-close the channel and return at
-	/// least a final ChannelMonitorUpdate::ChannelForceClosed which must be delivered to at
-	/// least one ChannelMonitor copy. Revocation secret MUST NOT be released and offchain channel
-	/// update must be rejected.
+	/// In order to broadcast the latest local commitment transaction, you'll need to call
+	/// [`ChannelMonitor::get_latest_holder_commitment_txn`] once you've safely ensured no further
+	/// off-chain updates to the channel can occur.
 	///
-	/// This failure may also signal a failure to update the local persisted copy of one of
-	/// the channel monitor instance.
+	/// Note that at least one final [`ChannelMonitorUpdate`] may still be provided, which must
+	/// still be processed by a running [`ChannelMonitor`]. This final update will mark the
+	/// [`ChannelMonitor`] as finalized, ensuring no further updates (e.g. revocation of the latest
+	/// commitment transaction) are allowed.
 	///
-	/// Note that even when you fail a holder commitment transaction update, you must store the
-	/// update to ensure you can claim from it in case of a duplicate copy of this ChannelMonitor
-	/// broadcasts it (e.g distributed channel-monitor deployment)
+	/// Note that even if you return a [`PermanentFailure`] due to unavailability of secondary
+	/// [`ChannelMonitor`] copies, you should still make an attempt to store the update where
+	/// possible to ensure you can claim HTLC outputs on the latest commitment transaction
+	/// broadcasted later.
 	///
 	/// In case of distributed watchtowers deployment, the new version must be written to disk, as
 	/// state may have been stored but rejected due to a block forcing a commitment broadcast. This
 	/// storage is used to claim outputs of rejected state confirmed onchain by another watchtower,
 	/// lagging behind on block processing.
+	///
+	/// [`PermanentFailure`]: ChannelMonitorUpdateErr::PermanentFailure
+	/// [`ChannelManager`]: crate::ln::channelmanager::ChannelManager
 	PermanentFailure,
 }
 
@@ -1203,14 +1214,18 @@ impl<Signer: Sign> ChannelMonitor<Signer> {
 	}
 
 	/// Used by ChannelManager deserialization to broadcast the latest holder state if its copy of
-	/// the Channel was out-of-date. You may use it to get a broadcastable holder toxic tx in case of
-	/// fallen-behind, i.e when receiving a channel_reestablish with a proof that our counterparty side knows
-	/// a higher revocation secret than the holder commitment number we are aware of. Broadcasting these
-	/// transactions are UNSAFE, as they allow counterparty side to punish you. Nevertheless you may want to
-	/// broadcast them if counterparty don't close channel with his higher commitment transaction after a
-	/// substantial amount of time (a month or even a year) to get back funds. Best may be to contact
-	/// out-of-band the other node operator to coordinate with him if option is available to you.
-	/// In any-case, choice is up to the user.
+	/// the Channel was out-of-date.
+	///
+	/// You may also use this to broadcast the latest local commitment transaction, either because
+	/// a monitor update failed with [`ChannelMonitorUpdateErr::PermanentFailure`] or because we've
+	/// fallen behind (i.e we've received proof that our counterparty side knows a revocation
+	/// secret we gave them that they shouldn't know).
+	///
+	/// Broadcasting these transactions in the second case is UNSAFE, as they allow counterparty
+	/// side to punish you. Nevertheless you may want to broadcast them if counterparty doesn't
+	/// close channel with their commitment transaction after a substantial amount of time. Best
+	/// may be to contact the other node operator out-of-band to coordinate other options available
+	/// to you. In any-case, the choice is up to you.
 	pub fn get_latest_holder_commitment_txn<L: Deref>(&self, logger: &L) -> Vec<Transaction>
 	where L::Target: Logger {
 		self.inner.lock().unwrap().get_latest_holder_commitment_txn(logger)
@@ -1833,7 +1848,9 @@ impl<Signer: Sign> ChannelMonitorImpl<Signer> {
 					if *should_broadcast {
 						self.broadcast_latest_holder_commitment_txn(broadcaster, logger);
 					} else if !self.holder_tx_signed {
-						log_error!(logger, "You have a toxic holder commitment transaction avaible in channel monitor, read comment in ChannelMonitor::get_latest_holder_commitment_txn to be informed of manual action to take");
+						log_error!(logger, "WARNING: You have a potentially-toxic holder commitment transaction avaible to broadcast");
+						log_error!(logger, "    in channel monitor for channel {}!", log_bytes!(self.funding_info.0.to_channel_id()));
+						log_error!(logger, "    Read the docs for ChannelMonitor::get_latest_holder_commitment_txn and take manual action!");
 					} else {
 						// If we generated a MonitorEvent::CommitmentTxConfirmed, the ChannelManager
 						// will still give us a ChannelForceClosed event with !should_broadcast, but we
diff --git a/lightning/src/ln/chanmon_update_fail_tests.rs b/lightning/src/ln/chanmon_update_fail_tests.rs
@@ -77,13 +77,23 @@ fn do_test_simple_monitor_permanent_update_fail(persister_fail: bool) {
 		_ => panic!("Unexpected event"),
 	};
 
+	assert!(nodes[0].tx_broadcaster.txn_broadcasted.lock().unwrap().is_empty());
+
 	// TODO: Once we hit the chain with the failure transaction we should check that we get a
 	// PaymentPathFailed event
 
 	assert_eq!(nodes[0].node.list_channels().len(), 0);
 	check_closed_event!(nodes[0], 1, ClosureReason::ProcessingError { err: "ChannelMonitor storage failure".to_string() });
 }
 
+#[test]
+fn test_simple_monitor_permanent_update_fail() {
+	do_test_simple_monitor_permanent_update_fail(false);
+
+	// Test behavior when the persister returns a PermanentFailure.
+	do_test_simple_monitor_permanent_update_fail(true);
+}
+
 #[test]
 fn test_monitor_and_persister_update_fail() {
 	// Test that if both updating the `ChannelMonitor` and persisting the updated
@@ -158,14 +168,6 @@ fn test_monitor_and_persister_update_fail() {
 	assert_eq!(events.len(), 1);
 }
 
-#[test]
-fn test_simple_monitor_permanent_update_fail() {
-	do_test_simple_monitor_permanent_update_fail(false);
-
-	// Test behavior when the persister returns a PermanentFailure.
-	do_test_simple_monitor_permanent_update_fail(true);
-}
-
 // If persister_fail is true, we have the persister return a TemporaryFailure instead of the
 // higher-level ChainMonitor.
 fn do_test_simple_monitor_temporary_update_fail(disconnect: bool, persister_fail: bool) {
diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
@@ -1014,7 +1014,7 @@ macro_rules! handle_monitor_err {
 				// given up the preimage yet, so might as well just wait until the payment is
 				// retried, avoiding the on-chain fees.
 				let res: Result<(), _> = Err(MsgHandleErrInternal::from_finish_shutdown("ChannelMonitor storage failure".to_owned(), *$chan_id,
-						$chan.force_shutdown(true), $self.get_channel_update_for_broadcast(&$chan).ok() ));
+						$chan.force_shutdown(false), $self.get_channel_update_for_broadcast(&$chan).ok() ));
 				(res, true)
 			},
 			ChannelMonitorUpdateErr::TemporaryFailure => {
@@ -3499,7 +3499,7 @@ impl<Signer: Sign, M: Deref, T: Deref, K: Deref, F: Deref, L: Deref> ChannelMana
 					// We do not do a force-close here as that would generate a monitor update for
 					// a monitor that we didn't manage to store (and that we don't care about - we
 					// don't respond with the funding_signed so the channel can never go on chain).
-					let (_monitor_update, failed_htlcs) = chan.force_shutdown(true);
+					let (_monitor_update, failed_htlcs) = chan.force_shutdown(false);
 					assert!(failed_htlcs.is_empty());
 					return Err(MsgHandleErrInternal::send_err_msg_no_close("ChannelMonitor storage failure".to_owned(), funding_msg.channel_id));
 				},