Skip to content

[AAP-73135] Fix Segment event loss by enabling sync_mode#383

Merged
cshiels-ie merged 9 commits intoansible:develfrom
cshiels-ie:AAP-73135-sync-mode-fix
Apr 30, 2026
Merged

[AAP-73135] Fix Segment event loss by enabling sync_mode#383
cshiels-ie merged 9 commits intoansible:develfrom
cshiels-ie:AAP-73135-sync-mode-fix

Conversation

@cshiels-ie
Copy link
Copy Markdown
Contributor

@cshiels-ie cshiels-ie commented Apr 28, 2026

Root cause

Segment silently drops events from batch POSTs that exceed 500 KB and returns HTTP 200, making the loss invisible — no error callback fires. The SDK queues all track() calls and flush() sends them as a single batch POST. With 15 chunks at ~25 KB data each, the actual HTTP body (including ~2–3 KB of per-event SDK metadata — anonymousId, timestamp, context, messageId, integrations) pushes the batch well over 500 KB.

Fix

Set analytics.sync_mode = True on the client before sending. This makes every track() call a separate blocking HTTP request (~25 KB each) rather than queuing to a background thread for batching. Each request is well within Segment's 32 KB per-request limit.

What was tried first

A batch-size tracking approach was implemented that called flush() before the accumulated data size exceeded 450 KB. End-to-end testing showed it still dropped events:

  • 17 chunks in one batch (~425 KB of data): only 12 arrived
  • 26 chunks split into two batches: only 17 arrived

The estimated 450 KB threshold did not account for the full SDK metadata overhead (~2–3 KB per event), so actual batch bodies still exceeded 500 KB. sync_mode=True eliminates batching entirely and is the only approach that delivered all chunks reliably in testing.

End-to-end validation

Tested against a live Segment source using the exact payload shape produced by flatten_json_report / anonymize_rollups (102 job-type rows, 81 installed-collection rows, ~112 KB total JSON):

Mode Chunks sent Chunks received
sync_mode=False (async batch) 15 11–14 (flaky)
sync_mode=True 15 15 ✓

Large payload stress test (26 chunks, ~650 KB data):

Mode Chunks sent Chunks received
Batch approach (450 KB limit) 26 17
sync_mode=True 26 26 ✓

Tests

  • test_put_sync_mode_enabled — verifies analytics.sync_mode is True, all chunks are tracked, and flush() is called exactly once (final flush only — sync_mode handles per-track delivery)
  • Existing test_put_sends_multiple_chunks_for_large_data — confirms all chunks are tracked and a single flush() fires at the end

References


Note

Medium Risk
Changes the delivery semantics of analytics emission from async/batched to synchronous per-event HTTP requests, which may impact performance and request timing while improving reliability.

Overview
Ensures Segment analytics uploads no longer rely on the SDK’s async batching by enabling analytics.sync_mode = True before emitting chunked track() events, preventing silent drops when batched payloads exceed Segment’s 500 KB limit.

Updates StorageSegment tests to assert sync_mode is enabled, that all chunks are tracked, and that flush() is still called exactly once; adds a new regression test covering the large-payload batching drop scenario.

Reviewed by Cursor Bugbot for commit 9f3eac0. Bugbot is set up for automated code reviews on this repo. Configure here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 4ce6e993-3c4f-492a-906b-e7df1667a987

📥 Commits

Reviewing files that changed from the base of the PR and between 9906c7b and 9f3eac0.

📒 Files selected for processing (2)
  • metrics_utility/library/storage/segment.py
  • metrics_utility/test/library/test_storage_segment.py

📝 Walkthrough

Summary by CodeRabbit

  • Improvements

    • Analytics events are now transmitted immediately rather than batched, ensuring faster data delivery to analytics services.
  • Tests

    • Enhanced test coverage with stricter validation of event transmission behavior and improved handling of large data sets.

Walkthrough

The pull request modifies Segment integration in the put method to enable sync_mode, causing each analytics.track() call to execute as a blocking HTTP request instead of being queued asynchronously. Corresponding test updates verify the synchronous behavior and payload correctness.

Changes

Cohort / File(s) Summary
Segment Synchronous Mode Implementation
metrics_utility/library/storage/segment.py
Enables sync_mode on the analytics instance within the put method, changing event tracking from asynchronous batching to synchronous HTTP requests.
Segment Tests
metrics_utility/test/library/test_storage_segment.py
Strengthened existing test assertions to verify sync_mode is enabled and uses call_count checks. Added new test to confirm large inputs return chunks with sync_mode=True ensuring one track() call per chunk and a single flush() at completion.

Sequence Diagram(s)

sequenceDiagram
    participant Client as StorageSegment.put()
    participant SDK as Segment SDK<br/>(sync_mode=True)
    participant HTTP as HTTP Layer

    rect rgba(100, 200, 100, 0.5)
    Note over Client,HTTP: New Behavior (Sync Mode)
    end

    loop For each chunk
        Client->>SDK: analytics.track(event, properties)
        activate SDK
        SDK->>HTTP: POST (blocking request)
        HTTP-->>SDK: Response
        deactivate SDK
        SDK-->>Client: Return (waits for completion)
    end

    Client->>SDK: analytics.flush()
    activate SDK
    SDK->>HTTP: Final flush
    HTTP-->>SDK: Response
    deactivate SDK
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: enabling sync_mode in Segment to fix event loss, with the Jira ticket reference.
Description check ✅ Passed The description covers root cause, fix, testing details, and end-to-end validation results, though it lacks the formal template sections and testing prerequisites/steps.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Replace time.sleep workaround with analytics.sync_mode = True so each
track() call sends synchronously rather than queuing to a background
thread, eliminating the race condition where the process exits before
the background thread finishes flushing.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@cshiels-ie cshiels-ie force-pushed the AAP-73135-sync-mode-fix branch from a7393bc to 3ecfaf2 Compare April 28, 2026 11:21
cshiels-ie and others added 2 commits April 28, 2026 12:54
Segment silently drops events from batch POSTs that exceed 500KB and
returns HTTP 200, making the loss invisible to on_error callbacks. Fix
by tracking accumulated batch size and calling flush() before adding a
chunk that would push the batch over 450KB (leaving headroom for the
per-event metadata the SDK appends).

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Cover two new cases:
- all chunks fit in one batch (flush called once, final only)
- chunks exceed BATCH_SIZE_LIMIT, triggering mid-loop flush (flush
  called more than once, all chunks still tracked)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@cshiels-ie cshiels-ie changed the title Aap 73135 sync mode fix [AAP-73135] Fix Segment event loss by flushing before 500KB batch limit Apr 28, 2026
…ufficient

Testing showed the batch-limit heuristic still dropped events:
- 17 chunks in one batch (~425 KB tracked): only 12 arrived
- 26 chunks split into two batches: only 17 arrived

Root cause: the SDK adds ~2-3 KB of per-event metadata (context,
timestamps, messageId, integrations) that our data-size estimate did
not account for, pushing actual batch bodies over Segment's 500 KB
limit despite our 450 KB threshold.

sync_mode=True sends each track() as a separate blocking HTTP request
(~25 KB each) instead of batching, which eliminates the batch-size
problem entirely. Local end-to-end testing confirmed all 15 chunks
arrive reliably with sync_mode=True.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@cshiels-ie cshiels-ie changed the title [AAP-73135] Fix Segment event loss by flushing before 500KB batch limit [AAP-73135] Fix Segment event loss by enabling sync_mode Apr 28, 2026
cshiels-ie and others added 3 commits April 28, 2026 13:16
Assert both sync_mode=True and flush.call_count==1, confirming that
sync_mode handles per-track delivery and no mid-loop batch flushing
is running alongside it.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Remove stale top comment
- Add sync_mode assertion to test_put_sends_data_to_segment so every
  put() test verifies the mode is set
- Rename test_put_sync_mode_enabled -> test_put_sync_mode_no_batch_drops
  and document the confirmed end-to-end result (15/15 chunks received
  vs 11-14 without sync_mode)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Each chunk is ~25 KB of JSON but compresses to ~3 KB (87% reduction)
due to repeated keys across items. With sync_mode sending one HTTP
request per chunk, gzip significantly reduces per-request transfer
time and overall upload duration.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@cshiels-ie cshiels-ie marked this pull request as ready for review April 28, 2026 15:50
Segment's tracking API returns HTTP 200 but discards events when the
request body is gzip-encoded, resulting in 0 events received despite
the SDK reporting success. gzip=True is not a viable optimisation.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@cshiels-ie cshiels-ie requested a review from himdel April 29, 2026 16:34
@sonarqubecloud
Copy link
Copy Markdown

@cshiels-ie cshiels-ie merged commit 4474a6c into ansible:devel Apr 30, 2026
5 checks passed
@cshiels-ie cshiels-ie deleted the AAP-73135-sync-mode-fix branch April 30, 2026 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants