Skip to content

Syscall Timeout for evtNext in Windows Event Log Receiver #47576

@pankaj101A

Description

@pankaj101A

Component(s)

receiver/windowseventlog

Is your feature request related to a problem? Please describe.

The evtNext function in api.go wraps the Windows EvtNext syscall. The current caller in subscription.go invokes it with timeout=0.

Timeout=0 tells the Windows API "don't wait for new events, return immediately." However, this parameter only controls the event-wait phase. It does not bound the time spent in the execution phase — including RPC transport, kernel I/O, storage driver operations, and security descriptor checks. These layers can hang indefinitely regardless of the timeout value.

This issue becomes significantly more severe in enterprise deployments where the OpenTelemetry Collector is configured to auto-discover and collect from Active Directory Domain Controllers (DCs).

Domain Controllers are uniquely problematic for this issue due to high number of evtNext calls

Describe the solution you'd like

Add a Go-level context-aware timeout wrapper at the Subscription level in subscription.go, leaving api.go unchanged:

Something like Below

var (
    errSyscallTimeout           = errors.New("evtNext syscall safety timeout exceeded")
    defaultSyscallSafetyTimeout = 10 * time.Second
)

func (s *Subscription) evtNextWithTimeout(
    ctx context.Context,
    resultSet uintptr,
    eventsSize uint32,
    events *uintptr,
    timeout, flags uint32,
    returned *uint32,
) error {
    done := make(chan error, 1)
    go func() {
        done <- evtNext(resultSet, eventsSize, events, timeout, flags, returned)
    }()

    safetyTimeout := s.syscallSafetyTimeout
    if safetyTimeout == 0 {
        safetyTimeout = defaultSyscallSafetyTimeout
    }
    totalTimeout := time.Duration(timeout)*time.Millisecond + safetyTimeout

    timer := time.NewTimer(totalTimeout)
    defer timer.Stop()

    select {
    case err := <-done:
        return err
    case <-ctx.Done():
        return ctx.Err()
    case <-timer.C:
        s.orphanedGoroutines.Add(1)
        s.logger.Warn("evtNext did not return within safety timeout; goroutine orphaned",
            zap.Duration("safety_timeout", totalTimeout),
            zap.Int64("orphaned_goroutines", s.orphanedGoroutines.Load()),
            zap.String("channel", s.channel),
        )
        return errSyscallTimeout
    }
}

Describe alternatives you've considered

No response

Additional context

No response

Tip

React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions