Skip to content

Conversation

pippellia-btc
Copy link
Contributor

I am running go-nostr v0.49.7 and recently got multiple times this panic:

May 04 22:47:20 vertex crawler[13698]: panic: runtime error: invalid memory address or nil pointer dereference
May 04 22:47:20 vertex crawler[13698]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x6c08d4]
May 04 22:47:20 vertex crawler[13698]: goroutine 2272862 [running]:
May 04 22:47:20 vertex crawler[13698]: github.com/nbd-wtf/go-nostr.(*Connection).WriteMessage(0x4002499f38?, {0xa9cbf0?, 0x4000164d20?}, {0x4000c89200?, 0x88d280?, 0x40013a8101?})
May 04 22:47:20 vertex crawler[13698]:         /home/admin/repos/working/go-nostr/connection.go:33 +0x24
May 04 22:47:20 vertex crawler[13698]: github.com/nbd-wtf/go-nostr.(*Relay).ConnectWithTLS.func2()
May 04 22:47:20 vertex crawler[13698]:         /home/admin/repos/working/go-nostr/relay.go:200 +0x168
May 04 22:47:20 vertex crawler[13698]: created by github.com/nbd-wtf/go-nostr.(*Relay).ConnectWithTLS in goroutine 27
May 04 22:47:20 vertex crawler[13698]:         /home/admin/repos/working/go-nostr/relay.go:185 +0x1d4

The function ConnectWithTLS is basically equivalent to the one in the go-nostr v0.51.10, and that's why I am opening this PR. I have to say I was NOT able to reproduce the issue, but looking at the code there is a clear candidate.

// to be used when the connection is closed
	go func() {
		<-r.connectionContext.Done()

		// stop the ticker
		ticker.Stop()

		// nil the connection
		r.Connection = nil

		// close all subscriptions
		for _, sub := range r.Subscriptions.Range {
			sub.unsub(fmt.Errorf("relay connection closed: %w / %w", context.Cause(r.connectionContext), r.ConnectionError))
		}
	}()

	// queue all write operations here so we don't do mutex spaghetti
	go func() {
		for {
			select {
			case <-ticker.C:
				if r.Connection != nil {
					err := r.Connection.Ping(r.connectionContext)
					if err != nil && !strings.Contains(err.Error(), "failed to wait for pong") {
						InfoLogger.Printf("{%s} error writing ping: %v; closing websocket", r.URL, err)
						r.Close() // this should trigger a context cancelation
						return
					}
				}
			case writeRequest := <-r.writeQueue:
				// all write requests will go through this to prevent races
				debugLogf("{%s} sending %v\n", r.URL, string(writeRequest.msg))
				if err := r.Connection.WriteMessage(r.connectionContext, writeRequest.msg); err != nil {
					writeRequest.answer <- err
				}
				close(writeRequest.answer)
			case <-r.connectionContext.Done():
				// stop here
				return
			}
		}
	}()

It's clear that here we have a race condition. Imagine r.connectionContext gets cancelled at the same time writeRequest is received from the channel.

Then, (at least) two conditions are true in the select statement, one of which will randomly fire. Let's say it fires the channel, then we have a race condition.

The first goroutine will set to nil the Connection the relay is using with WriteMessage.
This PR should solve the issue.

I also removed the if in the

	case <-ticker.C:
		if r.Connection != nil {
			//...

because it doesn't solve the problem: r.Connection might be non-nil when evaluated in the if, and immediately after could be set to nil by the first go-routine.

@fiatjaf fiatjaf merged commit 48bc94a into nbd-wtf:master May 7, 2025
1 check failed
@fiatjaf
Copy link
Collaborator

fiatjaf commented May 7, 2025

Thank you, I've tried to fix this many times but this solution had never occurred to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants