Skip to content

[rpc] Add Tracking to Subscriptions Made via HTTP for Unused Filter Eviction#18591

Open
onelapahead wants to merge 5 commits intoerigontech:mainfrom
kaleido-io:evictable-rpc-filters
Open

[rpc] Add Tracking to Subscriptions Made via HTTP for Unused Filter Eviction#18591
onelapahead wants to merge 5 commits intoerigontech:mainfrom
kaleido-io:evictable-rpc-filters

Conversation

@onelapahead
Copy link
Copy Markdown
Contributor

@onelapahead onelapahead commented Jan 8, 2026

This adds a missing feature from Geth: evicting stale filters via a background "timeout" loop. Similar to what was done in okx#777 but hopefully more efficient.

Meant to help avoid accumulating Go heap for unused filters which can further hurt performance over time.

Also adds metrics for keeping track of the number of filters we have so we can confirm they do not grow indefinitely.

…viction

Signed-off-by: hfuss <hayden.fuss@kaleido.io>
@onelapahead
Copy link
Copy Markdown
Contributor Author

onelapahead commented Jan 11, 2026

We've been soaking this in our nodes with success on ensuring unused filters are evicted overtime. This PR is ready for review now.

@onelapahead onelapahead marked this pull request as ready for review January 11, 2026 14:18
@yperbasis yperbasis added the RPC label Jan 13, 2026
Copy link
Copy Markdown
Member

@canepat canepat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, it looks like a solid contribution.

LGTM, please rebase on main and avoid changing execution-spec-tests commit

Signed-off-by: hfuss <hayden.fuss@kaleido.io>
@onelapahead onelapahead requested a review from lupin012 as a code owner March 8, 2026 22:38
@onelapahead
Copy link
Copy Markdown
Contributor Author

Thanks @canepat - should be ready for review now. We'll start soaking this on a cherry picked version of v3.3.8 including this change.

@onelapahead onelapahead requested a review from canepat March 9, 2026 12:11
Copy link
Copy Markdown
Member

@yperbasis yperbasis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Log store leak on remote filter update failure

Severity: High

In unsubscribeLogsInternal, if the remote logs filter update fails, the function returns false without calling deleteLogStore(id). The filter has already been removed from logsFilters (via removeLogsFilter),
but its associated log store in ff.logsStores is never cleaned up — it will accumulate indefinitely.

The old code always reached deleteLogStore(id) regardless of the remote update outcome. Fix:

// In unsubscribeLogsInternal, before return false in the error path:
ff.deleteLogStore(id)
return false

Also, the return value change from true → false when the remote update fails is defensible (the operation partially failed), but callers should be aware that the local filter was removed even though false is
returned.


Race condition: TOCTOU in eviction loop

Severity: Medium

The eviction uses a two-phase approach:

  1. Phase 1: Range over the SyncMap, calling ShouldEvict() and collecting IDs
  2. Phase 2: Iterate collected IDs, call unsubscribe*Internal()

Between phases, a client can call TouchSubscription() (resetting lastAccess via atomic store), but phase 2 does not re-check ShouldEvict(). An actively-polled subscription could be evicted.

The SyncMap uses sync.RWMutex — Range holds RLock, but releases it before phase 2. The atomic lastAccess update in Touch is invisible to the already-evaluated ShouldEvict.

Fix: Re-check ShouldEvict inside the phase-2 loop before unsubscribing:

for _, id := range headsToEvict {
if sub, ok := ff.headsSubs.Get(id); ok {
if !sub.ShouldEvict(timeout) {
continue // Touched since we collected it
}
// ... proceed with eviction
}
}

This doesn't eliminate the race entirely (another TOCTOU window between the re-check and unsubscribe), but it shrinks the window from O(full Range iteration) to O(single unsubscribe call), which is practical
given that the timeout is typically minutes.


Race condition: Double-close on concurrent unsubscribe

Severity: Low (mitigated by existing guard)

The eviction loop and a client-initiated UnsubscribeHeads can race on the same ID:

  1. Eviction: Get(id) → gets sub reference
  2. Client: Get(id) → gets same sub reference
  3. Eviction: Delete(id) + sub.Close()
  4. Client: sub.Close() again

The chan_sub.Close() has a if s.closed { return } guard under a mutex, which prevents a panic from double-closing the channel. However, both goroutines call decrementMetrics, leading to a double-decrement of
the gauge — the active subscription count would go negative.

Fix: The UnsubscribeHeads public method should handle the case where unsubscribeHeadsInternal returns false (already evicted) by skipping the decrementMetrics call — which it already does. So this is safe as
written for metrics. The Close() double-call is protected by the mutex guard. Low severity, but worth a comment.


Metrics: subscriptions renamed to subscriptions_active

Severity: Low (operational)

The gauge metric name changes from subscriptions to subscriptions_active, and gains a protocol label dimension. This is a breaking change for any existing Prometheus dashboards or alerting rules referencing
subscriptions{filter=...}.

Investigation shows the old activeSubscriptionsGauge was defined but never actually used in the codebase — so this is effectively creating a new metric, not breaking an existing one. The rename is fine.

The getSubscriptionCounter and getReapedCounter functions use embedded labels in metric names (e.g., subscriptions_created_total{filter="logs",protocol="http"}). This pattern is correct — the Erigon metrics
library (diagnostics/metrics/parsing.go) explicitly parses embedded {label="value"} syntax from metric names, and this pattern is used widely (e.g., in execution/commitment/commitment.go).


Summary

  │                        Issue                         │ Severity │               Type                │                                                                                                           
  ├──────────────────────────────────────────────────────┼──────────┼───────────────────────────────────┤
  │ Log store leak on remote update failure              │ High     │ Bug                               │                                                                                                           
  ├──────────────────────────────────────────────────────┼──────────┼───────────────────────────────────┤
  │ TOCTOU: eviction after Touch                         │ Medium   │ Race condition                    │                                                                                                           
  ├──────────────────────────────────────────────────────┼──────────┼───────────────────────────────────┤
  │ Double-close / double-decrement potential            │ Low      │ Race condition (mostly mitigated) │                                                                                                           
  ├──────────────────────────────────────────────────────┼──────────┼───────────────────────────────────┤                                                                                                           
  │ SetProtocol not thread-safe                          │ Low      │ Data race                         │
  ├──────────────────────────────────────────────────────┼──────────┼───────────────────────────────────┤                                                                                                           
  │ Metric rename (subscriptions → subscriptions_active) │ Low      │ Operational / breaking dashboards │
  ├──────────────────────────────────────────────────────┼──────────┼───────────────────────────────────┤                                                                                                           
  │ Exported DecrementMetrics unused                     │ Nit      │ Dead code                         │
  ├──────────────────────────────────────────────────────┼──────────┼───────────────────────────────────┤                                                                                                           
  │ 0 * time.Minute                                      │ Nit      │ Style                             │
  ├──────────────────────────────────────────────────────┼──────────┼───────────────────────────────────┤                                                                                                           
  │ Verbose per-touch Trace logging                      │ Nit      │ Performance                       │
  └──────────────────────────────────────────────────────┴──────────┴───────────────────────────────────┘                                                                                                           

The log store leak is the only must-fix before merge. The TOCTOU race should ideally be addressed (re-check before evicting), but given the timeout is on the order of minutes, it's unlikely to cause real issues
in practice. The rest are minor.

@yperbasis yperbasis added this to the 3.5.0 milestone Mar 21, 2026
@yperbasis yperbasis modified the milestones: 3.5.0, 3.6.0 May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants