Skip to content

CASSGO-41 Deadlock in refreshDebouncer when reconnection fails #1752

@kevinkyyro

Description

@kevinkyyro

What version of Cassandra are you using?

astra-classic

What version of Gocql are you using?

v1.6.0

What version of Go are you using?

1.21

What did you do?

Connection errors, I think due to overload, lead to frequent reconnection attempts and failures

What did you expect to see?

Should retry until connection succeeds

What did you see instead?

Deadlock

498297 goroutine 1324045437 [chan send, 113 minutes]:
498298 github.com/gocql/gocql.(*refreshDebouncer).stop(0xc0b826a7c0)
498299         /go/pkg/mod/github.com/gocql/[email protected]/host_source.go:848 +0x8c
498300 github.com/gocql/gocql.(*Session).Close(0xc03efb0c00)
498301         /go/pkg/mod/github.com/gocql/[email protected]/session.go:494 +0x105
498302 github.com/gocql/gocql.NewSession({{0xc24be58930, 0x3, 0x3}, {0x2ef55cf, 0x5}, 0x4, 0x12a05f200, 0x12a05f200, 0x0, 0x755a, ...})
498303         /go/pkg/mod/github.com/gocql/[email protected]/session.go:180 +0x98d
498304 github.com/gocql/gocql.(*ClusterConfig).CreateSession(...)
498305         /go/pkg/mod/github.com/gocql/[email protected]/cluster.go:289

It looks like a race condition between (*refreshDebouncer).stop() and (*refreshDebouncer).flusher()

  1. stop() acquires d.mu and sets d.stopped to true
  2. flusher() exits the select at the top of the loop and blocks on acquiring d.mu
  3. stop() releases d.mu and tries to write to d.quit
  4. flusher() acquires d.mu and returns because d.stopped is true
  5. stop() is deadlocked because d.quit is unbuffered and the reader has stopped

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions