Skip to content

Conversation

@nikagra
Copy link

@nikagra nikagra commented Nov 12, 2025

Handle Dynamic Shard Topology Changes in ScyllaDB Connection Picker

Addresses #605.

Problem

The ScyllaDB connection picker (scyllaConnPicker) had several critical issues when handling dynamic shard topology changes during cluster operations:

  1. Panics on shard count changes: When ScyllaDB nodes reported different shard counts than cached values, the driver would panic with "invalid number of shards" during connection pooling
  2. Connection leaks during topology changes: When ScyllaDB nodes reported different shard counts, the driver would panic immediately without performing any connection cleanup. This left existing connections in memory indefinitely, causing resource leaks. Recovery mechanisms had to establish entirely new connection pools while old connections remained unclosed.

Impact

  • Production crashes: Applications using gocql would panic and crash during cluster topology changes
  • Resource exhaustion: Connection leaks could eventually exhaust system resources
  • Performance degradation: Unnecessary connection cycling during topology changes may cause latency spikes

Solution

Implemented comprehensive shard topology change handling in scyllaConnPicker.Put().
Original code:

if nrShards != len(p.conns) {
    if nrShards != p.nrShards {
        panic(fmt.Sprintf("scylla: %s invalid number of shards", p.address))  // Immediate panic
    }
    // Array resize logic (only for non-topology changes)
    conns := p.conns
    p.conns = make([]*Conn, nrShards)
    copy(p.conns, conns)
}

New implementation:

if nrShards != p.nrShards {
    // Handle topology change gracefully with connection migration
    p.handleShardTopologyChange(conn, nrShards)
} else if nrShards != len(p.conns) {
    // Handle array size mismatch without topology change
    conns := p.conns
    p.conns = make([]*Conn, nrShards)
    copy(p.conns, conns)
}

Connection migration (sketch):

func (p *scyllaConnPicker) handleShardTopologyChange(newConn *Conn, newShardCount int) {
    // Preserve existing connections during topology changes
    for i, conn := range oldConns {
        if conn != nil && i < newShardCount {
            newConns[i] = conn  // Migrate valid connections
            migratedCount++
        } else if conn != nil {
            toClose = append(toClose, conn)  // Track excess for proper cleanup
        }
    }
    
    // Close excess connections in background to prevent blocking
    if len(toClose) > 0 {
        go closeConns(toClose...)
    }
}

Key Changes:

  1. handleShardTopologyChange() intelligently migrates connections during topology transitions
  2. Separate logic for topology changes vs array size mismatches
  3. Existing connections are migrated rather than lost during topology changes
  4. Excess connections are properly closed in background goroutines during shard count decreases to prevent performance impact

Testing

  • All existing tests pass
  • New comprehensive unit test suite covers topology change scenarios

@nikagra nikagra requested a review from dkropachev November 12, 2025 15:17
@nikagra nikagra self-assigned this Nov 12, 2025
@mykaul
Copy link

mykaul commented Nov 12, 2025

Addressing #605:

  • refactor: update ConnPicker interface to return error from Put method

It'd be great to explain the problem, the impact and the solution.

@dkropachev
Copy link
Collaborator

dkropachev commented Nov 12, 2025

I don't think you need to reserve to only returning an error in such case.
We need to make sure that pool get's adjusted to the new number of shards.
One of the was to do that is to kill the node pool and create a new one.

@nikagra
Copy link
Author

nikagra commented Nov 13, 2025

I don't think you need to reserve to only returning an error in such case. We need to make sure that pool get's adjusted to the new number of shards. One of the was to do that is to kill the node pool and create a new one.

I wanted to discuss this in more details with you. A few options were discussed on Monday. Returning errors instead of panics was mentioned as a few necessary step as far as I remember.

@nikagra
Copy link
Author

nikagra commented Nov 13, 2025

Addressing #605:

  • refactor: update ConnPicker interface to return error from Put method

It'd be great to explain the problem, the impact and the solution.

@mykaul I will update PR with this information before moving it from draft

@nikagra nikagra force-pushed the invalid-number-of-shards branch 2 times, most recently from baaf1a8 to ffd3eb7 Compare November 15, 2025 00:09
- Implement connection migration during shard count changes
- Preserve existing connections during shard count increases
- Close excess connections when shard count decreases
- Add comprehensive unit tests for topology change scenarios

Fixes: "invalid number of shards" panic during connection pooling

Co-authored-by: Dmitry Kropachev <[email protected]>
@nikagra nikagra force-pushed the invalid-number-of-shards branch from ffd3eb7 to 26a1b0e Compare November 15, 2025 00:17
@nikagra
Copy link
Author

nikagra commented Nov 15, 2025

It'd be great to explain the problem, the impact and the solution.

@mykaul Please check now.

@nikagra nikagra marked this pull request as ready for review November 15, 2025 00:51
@nikagra nikagra requested a review from dkropachev November 15, 2025 00:52
Copy link
Collaborator

@dkropachev dkropachev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, couple of minor changes

pool.initConnPicker(conn)
pool.connPicker.Put(conn)
if err := pool.connPicker.Put(conn); err != nil {
return err
Copy link
Collaborator

@dkropachev dkropachev Nov 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to close connection if put had failed, look at the Put code, probably connection is being closed there too, so you need to extract all connection closing code out to here, if it is possible.

p.logger.Printf("scylla: %s shard count changed from %d to %d, rebuilding connection pool",
p.address, p.nrShards, nrShards)
}
p.handleShardTopologyChange(conn, nrShards)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
p.handleShardTopologyChange(conn, nrShards)
p.handleShardCountChange(conn, nrShards)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also before do so, please check if new shard count is not a zero, if it is zero you need to drop the connection.

We also need to consider that it could persist, which may lead to connection storm.
Let's address it in here - #614

Comment on lines +532 to +539
for i, conn := range oldConns {
if conn != nil && i < newShardCount {
newConns[i] = conn
migratedCount++
} else if conn != nil {
toClose = append(toClose, conn)
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for i, conn := range oldConns {
if conn != nil && i < newShardCount {
newConns[i] = conn
migratedCount++
} else if conn != nil {
toClose = append(toClose, conn)
}
}
for i, conn := range oldConns {
if conn == nil {
continue
}
if i < newShardCount {
newConns[i] = conn
migratedCount++
} else {
toClose = append(toClose, conn)
}
}

@dkropachev
Copy link
Collaborator

Closed in favor of #619

@dkropachev dkropachev closed this Nov 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants