Pubsub flood due to same message propagated multiple times

### Checklist

- [X] This is a bug report, not a question. Ask questions on [discuss.ipfs.io](https://discuss.ipfs.io).
- [X] I have searched on the [issue tracker](https://github.com/ipfs/kubo/issues?q=is%3Aissue) for my bug.
- [X] I am running the latest [kubo version](https://dist.ipfs.tech/#kubo) or have an issue updating.

### Installation method

ipfs-update or dist.ipfs.tech

### Version

```Text
it's a spread.  Different nodes on the network are running different versions.  We cannot control when all node operators upgrade their nodes.
```


### Config

```json
n/a
```


### Description

<h1>Incident report from 3Box Labs (Ceramic) Team </h1>

<h2>Incident summary</h2>
The Ceramic pubsub topic has been experiencing a flood of pubsub messages beyond our usual load for the last several days now.  We log every pubsub message we receive on the nodes that we run, and running analysis on those logs using LogInsights shows us that we are receiving messages with the exact same `seqno` multiple times - one message can show up upwards of 15 times in an hour.  During normal operation we do not åsee this issue with seqnos showing up multiple times.  This dramatic increase in the number of messages that need processing is causing excess load on our nodes that is causing major performance problems, even with as much caching and de-duplication as we can do at our layer.

<h2>Evidence of the issue</h2>

Graph of our incoming pubsub message activity showing how the number of messages spiked way up a few days ago.  The rate before 2/20 was our normal, expected amount of traffic:
<img width="1358" alt="Screen Shot 2023-02-25 at 1 47 14 PM" src="https://user-images.githubusercontent.com/870767/221379056-d1f8d222-c69c-403a-9a84-ef9beeb9b791.png">

AWS LogInsights Query demonstrating how the majority of this increased traffic is due to seeing the same message (with the same seqno) re-delivered multiple times.  Before the spike we never saw a `msg_count` greater than 2.

<img width="1026" alt="Screen Shot 2023-02-25 at 1 49 09 PM" src="https://user-images.githubusercontent.com/870767/221379083-607e8122-9625-499c-a499-c99034c9a130.png">
<img width="617" alt="Screen Shot 2023-02-25 at 1 49 27 PM" src="https://user-images.githubusercontent.com/870767/221379089-612e1702-ee89-4d7c-ab66-b1056cacf323.png">


<h2>Steps to reproduce</h2>
Connect to the gossipsub topic `/ceramic/mainnet`.  Observe the messages that come in, keep track of the number of times you see a message with each `seqno`.  You'll see that over the span of an hour you see the same message with the same `seqno` delivered multiple times


<h2>Historical context</h2>
We have seen this happen before, in fact it's happened to us several times over the last year, and we've reported it to PL multiple times.  You can see our original report here (at the time we were still using js-ipfs): https://github.com/libp2p/js-libp2p/issues/1043.  When this happened again after we had migrated to go-ipfs, we reported it again, this time on slack: https://filecoinproject.slack.com/archives/C025ZN5LNV8/p1661459082059149?thread_ts=1661459082.059149&cid=C025ZN5LNV8

We have since discovered a bug in how go-libp2p-pubsub maintained the seenMessage cache and worked to get a fix into kubo 0.18.1: https://github.com/libp2p/go-libp2p-pubsub/issues/502

We have updated our nodes to 0.18.1, but of course we have no direct control over what versions of ipfs/kubo the rest of the nodes on the Ceramic network are running, so even if the above bugfix would resolve the issue if every single node on the network were to upgrade to it, we have no real way to enforce that and no idea how long it will be (if ever) before there are no older ipfs nodes participating in our pubsub topic.  Not to mention the possibility of a malicious node connecting to our pubsub topic and publishing a large volume of bogus messages (or re-broadcasting valid messages).  So no matter what, we need a way to respond to incidents like this that goes beyond "get your users to upgrade to the newest kubo and pray that that makes the problem go away", which has been what we've been told every time we're reported this issue so far.

<h2>Our request from Protocol Labs</h2>

This is an extremely severe incident that has affected us multiple times over the last year.  It strikes without warning and leaves our network crippled.  Every previous time this happened it cleared up on its own within a day or so, but this one has been going on for 5 days now without letting up.  We need **some way** to respond to incidents like this, and to potential malicious attacks in the future where someone intentionally floods our network with pubsub traffic.

So our questions for PL are:
* What short term options can we take on the nodes that we operate, or that we can tell our community of node operators to take, to resolve this immediate issue that is currently affecting our production network?
* Do you have any tools for inspecting the p2p network that would let us identify which node(s) are the source of the issue?  If we knew that the issue was because of one problematic node that was, for instance, running a very old version of ipfs or running on very underpowered hardware, we could potentially reach out to them directly and get them to upgrade or take down their node.  Or perhaps we could tell existing node operators to block connections from that problematic peer.
* What is the recommended way in general to respond to nodes that (either intentionally through malice or accidentally through a bug) spam a pubsub topic with bogus or re-broadcast messages?
* What additional steps can we take going forward to prepare our network to be more resilient to issues like this in the future?


Thank you for your time and attention to this important issue!

-Spencer, Ceramic Protocol Engineer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Pubsub flood due to same message propagated multiple times #9665

Checklist

Installation method

Version

Config

Description

Incident report from 3Box Labs (Ceramic) Team

Incident summary

Evidence of the issue

Steps to reproduce

Historical context

Our request from Protocol Labs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Pubsub flood due to same message propagated multiple times #9665

Description

Checklist

Installation method

Version

Config

Description

Incident report from 3Box Labs (Ceramic) Team

Incident summary

Evidence of the issue

Steps to reproduce

Historical context

Our request from Protocol Labs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions