Skip to content

Conversation

@ofekshenawa
Copy link
Collaborator

@ofekshenawa ofekshenawa commented Dec 3, 2025

This PR implements Observability (Metrics) for go-redis, adding comprehensive OpenTelemetry metrics support following the OpenTelemetry Database Client Semantic Conventions.

The implementation uses a Bridge Pattern to keep the core library dependency-free while providing optional, metrics instrumentation through the new extra/redisotel-native package.

What's Included:
Metric groups covering aspects of Redis operations:

  • Command metrics: Operation duration with retry tracking
  • Connection basic: Connection count and creation time
  • Resiliency: Errors, handoffs, timeout relaxation
  • Connection advanced: Wait time and use time
  • Pubsub metrics: Published and received messages
  • Stream metrics: Processing duration and maintenance notifications

@ofekshenawa ofekshenawa marked this pull request as ready for review December 3, 2025 15:06
Copy link
Member

@ndyakov ndyakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

submitting partial review

Comment on lines 35 to 71
// Global callback for connection state changes (set by otel package)
connectionStateChangeCallback func(ctx context.Context, cn *Conn, fromState, toState string)

// Global callback for connection creation time (set by otel package)
connectionCreateTimeCallback func(ctx context.Context, duration time.Duration, cn *Conn)

// Global callback for connection relaxed timeout changes (set by otel package)
// Parameters: ctx, delta (+1/-1), cn, poolName, notificationType
connectionRelaxedTimeoutCallback func(ctx context.Context, delta int, cn *Conn, poolName, notificationType string)

// Global callback for connection handoff (set by otel package)
// Parameters: ctx, cn, poolName
connectionHandoffCallback func(ctx context.Context, cn *Conn, poolName string)

// Global callback for error tracking (set by otel package)
// Parameters: ctx, errorType, cn, statusCode, isInternal, retryAttempts
errorCallback func(ctx context.Context, errorType string, cn *Conn, statusCode string, isInternal bool, retryAttempts int)

// Global callback for maintenance notifications (set by otel package)
// Parameters: ctx, cn, notificationType
maintenanceNotificationCallback func(ctx context.Context, cn *Conn, notificationType string)

// Global callback for connection wait time (set by otel package)
// Parameters: ctx, duration, cn
connectionWaitTimeCallback func(ctx context.Context, duration time.Duration, cn *Conn)

// Global callback for connection use time (set by otel package)
// Parameters: ctx, duration, cn
connectionUseTimeCallback func(ctx context.Context, duration time.Duration, cn *Conn)

// Global callback for connection timeouts (set by otel package)
// Parameters: ctx, cn, timeoutType
connectionTimeoutCallback func(ctx context.Context, cn *Conn, timeoutType string)

// Global callback for connection closed (set by otel package)
// Parameters: ctx, cn, reason, err
connectionClosedCallback func(ctx context.Context, cn *Conn, reason string, err error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This global callbacks are not thread safe. Someone may set (or set nil - remove) the callback while a check on its existence is happening. In the best case scenario - we will end up having undefined behaviour, worst case - nil pointer panics. We should fix this with either a global callback mutex or atomic values here. I do think a RWMutex will be an easier and cleaner approach.

if err := p.waitTurn(ctx); err != nil {
// Record timeout if applicable
if err == ErrPoolTimeout && connectionTimeoutCallback != nil {
connectionTimeoutCallback(ctx, nil, "pool")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also record general error metric here?


// Global callback for error tracking (set by otel package)
// Parameters: ctx, errorType, cn, statusCode, isInternal, retryAttempts
errorCallback func(ctx context.Context, errorType string, cn *Conn, statusCode string, isInternal bool, retryAttempts int)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that there's only two places where this metric is exposed. The metric should expose ALL errors in a client including command execution failure.

otel.go Outdated
Comment on lines 109 to 115
// toConnInfo converts internal pool.Conn to public ConnInfo interface
func toConnInfo(cn *pool.Conn) ConnInfo {
if cn != nil {
return cn
}
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed.

Copy link
Member

@ndyakov ndyakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left two comments that can be considered for all get/set operations. other than that the thread safety looks ok if we are willing to accept that there is a time window for which getters may have returned an old value. can we make sure if we remove the telemetry, this won't break the app? In the case when the old value is nil I assume we can accept a single event not being tracked.

Comment on lines 98 to 103
func getConnectionStateChangeCallback() func(ctx context.Context, cn *Conn, fromState, toState string) {
callbackMu.RLock()
cb := connectionStateChangeCallback
callbackMu.RUnlock()
return cb
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a slight chance that connectionStateChangeCallback can be changed after the runlick but before the use. If we agree that this timeframe is something we are willing to tolerate, this is fine.

Comment on lines 207 to 211
func SetConnectionTimeoutCallback(fn func(ctx context.Context, cn *Conn, timeoutType string)) {
callbackMu.Lock()
connectionTimeoutCallback = fn
callbackMu.Unlock()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are setting multiple callbacks (as we should in most cases) wouldn't it be better to have a single lock -> set all callbacks -> unlock block for them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants