Skip to content

Conversation

@RagingRedRiot
Copy link
Contributor

@RagingRedRiot RagingRedRiot commented Nov 21, 2025

Rewrite Mimecast adapter with multi-API support

  • Add OAuth 2.0 token caching with automatic refresh
  • Implement concurrent fetching across 5 API endpoints (audit events, attachment, impersonation, URL, DLP)
  • Add configurable base URL, initial lookback, and worker concurrency
  • Improve rate limiting with Retry-After header support
  • Add graceful shutdown with proper context cancellation
  • Implement hash-based deduplication for logs without IDs
  • Handle nested log structures (e.g., attachmentLogs arrays)
  • Add per-API enable/disable on 403 Forbidden responses
  • Replace simple loop with semaphore-controlled fetch cycles

Type of change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Note

Replaces the Mimecast adapter with an OAuth-backed, concurrent multi-API fetcher, adds new config, robust retry/rate-limit handling, deduping, and graceful shutdown.

  • Adapter rewrite:
    • Switch to OAuth 2.0 with cached tokens and singleflight refresh; remove getAuthToken and baseURL const.
    • Concurrent fetching via errgroup across APIs: auditEvents, attachment, impersonation, url, dlp.
    • New fetch loop (queryInterval) with per-API state (since, active, per-API dedupe).
  • Config:
    • Add base_url, initial_lookback, max_concurrent_workers; validate and set sane defaults.
  • HTTP/transport:
    • New tuned http.Client and header handling; page size increased to 500.
  • Resilience & rate limiting:
    • Handle 401 (token reset/retry), 403 (disable API), 429 with Retry-After, and 5xx with capped backoff; 1h retry deadline.
  • Data handling:
    • ApiResponse.Data now []map[string]interface{}; support nested log arrays (e.g., attachmentLogs, urlLogs).
    • Hash-based deduplication when no ID field; per-API timestamped dedupe culling.
  • Shutdown & shipping:
    • Context-driven graceful shutdown with closeOnce/fetchOnce, chFetchLoop; improved Close() waiting.
    • submitEvents streams with backpressure handling and cancellation on prolonged buffer full.

Written by Cursor Bugbot for commit 7407d8a. This will update automatically on new commits. Configure here.

@maximelb
Copy link
Contributor

Code Review Findings

Critical Issues

1. Token Refresh Race Condition (client.go:198-215)

Multiple goroutines can simultaneously call refreshOAuthToken when a token expires. The code unlocks before calling refresh (line 211), allowing all waiting goroutines to proceed with token refresh, causing unnecessary API calls and potential rate limiting.

// Multiple goroutines pass this check simultaneously
if a.oauthToken != "" && time.Now().Before(a.tokenExpiry) {
    token := a.oauthToken
    a.tokenMu.Unlock()  // <-- Unlocked here
    return map[string]string{...}, nil
}
a.tokenMu.Unlock()  // <-- All goroutines unlock and call refresh
return a.refreshOAuthToken(ctx)  // <-- Multiple simultaneous calls

Fix: Use double-checked locking or sync.Once per refresh cycle to ensure only one goroutine refreshes the token.


2. Negative Token Expiry Duration (client.go:261)

If ExpiresIn < 60, this creates a negative duration, causing immediate token invalidation:

a.tokenExpiry = time.Now().Add(time.Duration(tokenResp.ExpiresIn-60) * time.Second)

Fix: Use max(tokenResp.ExpiresIn-60, 0) or handle short-lived tokens differently.


3. Unstable Hash-Based Deduplication (client.go:890-900)

json.Marshal(logMap) produces non-deterministic output due to Go map iteration order being random. Same event will generate different hashes, causing duplicates to be sent:

jsonBytes, err := json.Marshal(logMap)  // Map order is random in Go
hash := sha256.Sum256(jsonBytes)

Fix: Sort keys before marshaling or use a deterministic serialization method.


4. Context Mismatch (client.go:150)

USP client uses parent context while adapter uses child context. If parent cancels, USP client stops but adapter continues:

ctxChild, cancel := context.WithCancel(ctx)
a.uspClient, err = uspclient.NewClient(ctx, conf.ClientOptions)  // Should use ctxChild

Fix: Use ctxChild when creating the USP client.


5. Negative Retry-After Duration (client.go:652)

If retryAfterTime is in the past, time.Until() returns negative duration:

retryUntilTime := time.Until(retryAfterTime).Seconds()  // Can be negative
if err := sleepContext(a.ctx, time.Duration(retryUntilTime)*time.Second)

Fix: Add validation: if retryAfterTime.Before(time.Now()) { continue }


High Priority Issues

6. No Validation for MaxConcurrentWorkers (client.go:130-132)

Accepts any positive value. Setting to 100,000 creates 100,000 goroutines:

if c.MaxConcurrentWorkers == 0 {
    c.MaxConcurrentWorkers = 10
}
// No upper bound check

Fix: Add reasonable upper limit (e.g., 100).


7. Retryable Errors Not Retried (client.go:732-740)

Mimecast API returns Retryable: true for errors that should be retried, but code treats all API errors as fatal:

errorMessages = append(errorMessages, fmt.Sprintf("%s: %s (retryable: %v)", 
    errDetail.Code, errDetail.Message, errDetail.Retryable))
}
// Returns error without checking Retryable field
return nil, fmt.Errorf("mimecast api errors: %v", errorMessages)

Fix: Check Retryable field and retry with backoff for retryable errors.


8. Inconsistent 5XX Error Handling (client.go:704-712 vs 714-720)

5XX errors return nil, nil (no error), other non-200 return error. This prevents 5XX errors from being logged via the error channel in fetchApi:483:

if status >= 500 && status < 600 {
    return nil, nil  // Silent failure, no error propagated
}
if status != http.StatusOK {
    return allItems, err  // Error propagated
}

Fix: Return error for 5XX to ensure proper error tracking, or document this design choice.


Medium Priority Issues

11. Semaphore Hardcoded (client.go:370)

shipperSem fixed at 2, ignoring MaxConcurrentWorkers config:

shipperSem := make(chan struct{}, 2)  // Always 2, regardless of config

Fix: Make this configurable or document why it's separate from worker concurrency.


12. Unused Variable (client.go:418, 438-439)

count incremented but never used. Dead code.

count := 0
// ...
count += len(events)

Fix: Remove or use for metrics/logging.


13. Redundant querySucceeded Flag (client.go:531, 823, 828)

Only set to true at line 823, checked at 828. Always true at check point:

var querySucceeded bool
// ... lots of code ...
querySucceeded = true  // Line 823
if querySucceeded {    // Line 828 - always true here

Fix: Remove flag and simplify logic.

Copy link
Contributor

@maximelb maximelb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also getting the robot to post some relevant findings from its review as comments to the PR.

@maximelb
Copy link
Contributor

@RagingRedRiot Let me know if you prefer we pick up the PR from here and make mods vs you doing it.

@RagingRedRiot
Copy link
Contributor Author

@maximelb I'm capable of making the updates, but I'm not protective of being the one to do them.

@RagingRedRiot
Copy link
Contributor Author

This isn't actually true.

  1. No Validation for MaxConcurrentWorkers (client.go:130-132)
    Accepts any positive value. Setting to 100,000 creates 100,000 goroutines:

if c.MaxConcurrentWorkers == 0 {
c.MaxConcurrentWorkers = 10
}
// No upper bound check
Fix: Add reasonable upper limit (e.g., 100).

The MaxConcurrentWorkers is only a max limit of concurrent routines using Semaphores, it doesn't actually spawn that many routines. The impact is a small memory consumption from generating a large channel size.

@RagingRedRiot
Copy link
Contributor Author

  1. Inconsistent 5XX Error Handling (client.go:704-712 vs 714-720)
    5XX errors return nil, nil (no error), other non-200 return error. This prevents 5XX errors from being logged via the error channel in fetchApi:483:

if status >= 500 && status < 600 {
return nil, nil // Silent failure, no error propagated
}
if status != http.StatusOK {
return allItems, err // Error propagated
}
Fix: Return error for 5XX to ensure proper error tracking, or document this design choice.

The code does log 5XX errors via a.conf.ClientOptions.OnError(err). However, it intentionally returns nil error to the caller to prevent treating transient server errors as fatal:

if status >= 500 && status < 600 {
err := fmt.Errorf("mimecast server error: %d\nRESPONSE: %s", status, string(respBody))
a.conf.ClientOptions.OnError(err)
// We don't want this to be handled like an error
// The hope is these errors are temporary
if len(allItems) > 0 {
return allItems, nil
}
return nil, nil
}

  • 5XX errors are typically transient (server overload, temporary outage)
  • Logging via OnError ensures visibility and monitoring
  • Returning nil error prevents the adapter from shutting down or treating the API as permanently failed
  • The next fetch cycle (30 seconds later) will retry automatically
  • If partial data was collected (allItems), it's still returned and shipped

@RagingRedRiot
Copy link
Contributor Author

@maximelb I believe I have addressed all code review findings.

@maximelb
Copy link
Contributor

/gcbrun

- Fix error variable shadowing bug where err from io.ReadAll was being
  shadowed by err from strconv.Atoi/http.ParseTime
- Fix mutex contention by not holding tokenMu during HTTP calls in
  refreshOAuthToken
- Fix silent error ignore in submitEvents for non-ErrorBufferFull errors
- Fix potential deadlock by using context cancellation instead of
  calling Close() from within fetch loop goroutines
- Fix tight loop when Retry-After time has already passed by adding
  minimum 1 second sleep
- Fix 5xx errors being swallowed - now properly returns error so
  api.since won't be updated and data won't be lost
- Fix struct tag alignment inconsistencies in MimecastConfig
- Fix generateLogHash to use JSON marshaling for deterministic hashing
  of complex/nested values
- Add shutdown check in submitEvents loop

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is being reviewed by Cursor Bugbot

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

When there was no Retry-After header or it couldn't be parsed,
retryAfterTime remained the zero value. The condition
retryAfterTime.Before(time.Now()) was always true for the zero value
(year 0001 is before current time), causing the code to incorrectly
enter the "time already passed" branch (1s wait) instead of the
"no header" branch (60s wait).

Fix by checking !retryAfterTime.IsZero() before the Before check
and restructure the conditions for clarity.

Also added comment documenting that InitialLookback defaults to zero.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@maximelb
Copy link
Contributor

/gcbrun

- Replace complex nested goroutine structure with errgroup for cleaner
  concurrency control and automatic cancellation propagation
- Fix data race in shouldShutdown() by using api.IsActive() instead of
  direct field access
- Fix token refresh race condition with double-checked locking
- Fix Retry-After duration truncation by using time.Duration directly

The refactored RunFetchLoop is ~75 lines shorter and eliminates:
- 3 levels of nested goroutines
- 4 coordination channels (cycleSem, shipperSem, shipCh, shipDone)
- Multiple early exit paths that could leak goroutines

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Remove unused MaxConcurrentShippers config field
- Remove unused AuditLog type
- Add 1-hour retry deadline for 429 rate limiting
- Add 5XX retry with exponential backoff (30s-5m), 1h max
- Use singleflight for token refresh to prevent thundering herd
- Extend dedupe cleanup window from 60s to 1 hour
- Fix minor style issues (indentation, blank lines)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Increase Close() timeout from 10s to 2min to allow in-flight HTTP
  requests (60s timeout) and Ship() calls to complete gracefully
- Reset retry counters after each successful page fetch so each page
  gets a fresh retry budget

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
}
// Handle non-ErrorBufferFull errors
a.conf.ClientOptions.OnError(fmt.Errorf("Ship(): %v", err))
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Non-buffer-full shipping errors don't trigger shutdown

When uspClient.Ship returns an error that isn't ErrorBufferFull, the error is logged on line 926 but the loop continues processing subsequent events. Other adapters in the codebase stop and signal shutdown when any non-recoverable ship error occurs. This allows the adapter to silently drop messages that fail to ship due to unexpected errors, continuing operation in a potentially broken state instead of failing cleanly.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants