Skip to content

client_routes: simplify concurrency, fix stale entries, add sticky routing#850

Draft
sylwiaszunejko wants to merge 3 commits intoscylladb:masterfrom
sylwiaszunejko:client-routes-rv-lock
Draft

client_routes: simplify concurrency, fix stale entries, add sticky routing#850
sylwiaszunejko wants to merge 3 commits intoscylladb:masterfrom
sylwiaszunejko:client-routes-rv-lock

Conversation

@sylwiaszunejko
Copy link
Copy Markdown
Collaborator

@sylwiaszunejko sylwiaszunejko commented Apr 16, 2026

Summary

Replaces the complex resolved-route caching and background DNS resolution system in ClientRoutesHandler with a simpler design: store unresolved routes and resolve DNS on demand during TranslateHost. Adds scoped pruning to automatically clean up stale entries, and sticky route tracking to avoid unnecessary connection churn.

Motivation

The previous implementation maintained a ResolvedClientRoute type hierarchy with cached IP addresses, a ClientRoutesResolver interface, a DNS cache with configurable TTL, concurrent background resolution workers, and an atomic.Pointer with CAS-loop updates. This was complex, hard to reason about, and prone to stale-data bugs (e.g. decommissioned hosts lingering in the route list).

Fixes: #835

Comment thread client_routes.go Outdated
Comment thread client_routes.go Outdated
Comment thread session.go Outdated
@sylwiaszunejko sylwiaszunejko changed the title client_routes: simplify concurrency model and fix stale entry pruning client_routes: simplify concurrency, fix stale entries, add sticky routing Apr 16, 2026
@sylwiaszunejko sylwiaszunejko force-pushed the client-routes-rv-lock branch 3 times, most recently from 0a5f713 to 179bf31 Compare April 16, 2026 18:36
@sylwiaszunejko sylwiaszunejko force-pushed the client-routes-rv-lock branch from 179bf31 to 73467e8 Compare April 17, 2026 06:36
@sylwiaszunejko sylwiaszunejko self-assigned this Apr 17, 2026
@sylwiaszunejko sylwiaszunejko force-pushed the client-routes-rv-lock branch from 73467e8 to 8de13e0 Compare April 17, 2026 06:39
Remove the entire ResolvedClientRoute type hierarchy, ClientRoutesResolver
interface, simpleClientRoutesResolver with DNS cache, and resolveAndUpdateInPlace.
Replace atomic.Pointer with sync.RWMutex for route storage.

Routes are now stored as unexported clientRouteList with Merge and FindByHostID
methods. TranslateHost resolves DNS on every call via DNSResolver.LookupIP
rather than maintaining cached resolved addresses.

Rename UnresolvedClientRoute/UnresolvedClientRouteList to unexported
clientRoute/clientRouteList since the resolved counterpart no longer exists.

Also deprecates ResolveHealthyEndpointPeriod, MaxResolverConcurrency, and
ResolverCacheDuration config fields and their WithXxx option functions.
@sylwiaszunejko sylwiaszunejko force-pushed the client-routes-rv-lock branch from 8de13e0 to dd49949 Compare April 17, 2026 08:05
Comment thread client_routes.go
Comment on lines 109 to 125
@@ -119,169 +115,54 @@ type UnresolvedClientRoute struct {
}

// Similar returns true if both records targets same host and connection id
func (r UnresolvedClientRoute) Similar(o UnresolvedClientRoute) bool {
func (r clientRoute) Similar(o clientRoute) bool {
return r.ConnectionID == o.ConnectionID && r.HostID == o.HostID
}

// Equal returns true if both records are exactly the same
func (r UnresolvedClientRoute) Equal(o UnresolvedClientRoute) bool {
func (r clientRoute) Equal(o clientRoute) bool {
return r == o
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ Is it acceptable/common in Go to have types names in camelCase instead of CapitalizedCamelCase?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming from Rust - where types are CapitalizedCamelCase - this is confusing.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in Go camelCase is for internal types and CapitalizedCamelCase is for public ones, I think UnresolvedClientRoute and ResolvedClientRoute were capitalized by mistake, I don't think anyone should use them externaly, but probably to not break the API I should left it as it is. Still I wanted to propose different naming and making it internal, maybe we could agree to break the API like that

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not in this release then maybe it's something to do in v2

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. If it's API-breaking, then we technically mustn't do this without releasing v2.
  2. Did any public API reference those types? If not, we may consider them "exported by mistake", without any practical implications - if no one had any reason to reference them, then no one should suffer from deleting them.
  3. I vote for renaming then. @Lorak-mmk WDYT?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's API-breaking, then we technically mustn't do this without releasing v2.

Is that the case? I remember @dkropachev saying that in gocql API breaks are fine in minor versions.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did any public API reference those types?

nope

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://kn100.me/making-breaking-changes-in-go/
I don't know how up-to-date and correct this article is, but it makes me feel uneasy...

Comment thread client_routes.go
Comment thread client_routes.go
Add scopeConnectionIDs and scopeHostIDs parameters to Merge so it can
prune stale entries before upserting:
- Both non-empty (partial update): prune entries matching BOTH lists
- Only scopeConnectionIDs (full refresh): prune all entries for those connections
- Both empty: additive-only merge, no pruning

updateHostPortMapping now passes connection/host IDs through to Merge,
enabling automatic cleanup of decommissioned hosts.
When a host has multiple routes (one per connection), remember which
connectionID was used on the first successful TranslateHost call and
prefer it on subsequent calls via findPreferredRoute. If the preferred
route is removed (e.g. connection pruned), fall back to FindByHostID.

This avoids unnecessary connection churn when multiple PrivateLink
endpoints serve the same host.
Comment thread client_routes.go
}

type UnresolvedClientRouteList []UnresolvedClientRoute
type clientRouteList []clientRoute
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 Why do we operate on slice instead of map?

  1. Slice makes lookup quadratic, which is acceptable only because client routes-related operations only take place on connecting, not on executing statements.
  2. Map is conceptually better to represent the mapping of host ids to client routes entries.
  3. See the Rust Driver's impl for how use of a map improves the logic.

Comment thread client_routes.go
Comment on lines +140 to +151
func (l *clientRouteList) Len() int {
return len(*l)
}

type ResolvedClientRoute struct {
updateTime time.Time
UnresolvedClientRoute
allKnownIPs []net.IP
currentIP net.IP
forcedResolve bool
}

func (r ResolvedClientRoute) String() string {
var ip string
if r.currentIP == nil {
ip = "<nil>"
} else {
ip = r.currentIP.String()
}

return fmt.Sprintf(
"ResolvedClientRoute{ConnectionID=%s, HostID=%s, Address=%s, CQLPort=%d, SecureCQLPort=%d, CurrentIP=%s}",
r.ConnectionID,
r.HostID,
r.Address,
r.CQLPort,
r.SecureCQLPort,
ip,
)
}

func (r ResolvedClientRoute) Clone() ResolvedClientRoute {
res := r
if res.allKnownIPs != nil {
res.allKnownIPs = make([]net.IP, 0, len(r.allKnownIPs))
for _, ip := range r.allKnownIPs {
res.allKnownIPs = append(res.allKnownIPs, slices.Clone(ip))
}
}
if len(res.currentIP) != 0 {
copy(res.currentIP, r.currentIP)
res.currentIP = slices.Clone(res.currentIP)
}
return res
}

// Newer returns true if o is newer than r
func (r ResolvedClientRoute) Newer(o ResolvedClientRoute) bool {
if len(r.currentIP) == 0 && len(o.currentIP) != 0 {
return true
}
if len(r.allKnownIPs) == 0 && len(o.allKnownIPs) != 0 {
return true
// Merge upserts entries from incoming into the list.
// Before upserting it prunes stale entries within the query scope defined by
// scopeConnectionIDs and scopeHostIDs:
// - Both non-empty (partial update): prune entries matching BOTH lists.
// - Only scopeConnectionIDs (full refresh): prune all entries for those connections.
//
// scopeConnectionIDs must not be empty.
func (l *clientRouteList) Merge(incoming clientRouteList, scopeConnectionIDs, scopeHostIDs []string) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As clientRouteList is now a private type, shouldn't the methods be private, too? (camelCase, not CapitalizedCamelCase).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Client routes: preserve explicit stickiness semantics and prune stale routes on partial CLIENT_ROUTES_CHANGE updates

3 participants