[AAP-73135] Fix Segment event loss by enabling sync_mode by cshiels-ie · Pull Request #383 · ansible/metrics-utility

cshiels-ie · 2026-04-28T11:20:23Z

Root cause

Segment silently drops events from batch POSTs that exceed 500 KB and returns HTTP 200, making the loss invisible — no error callback fires. The SDK queues all track() calls and flush() sends them as a single batch POST. With 15 chunks at ~25 KB data each, the actual HTTP body (including ~2–3 KB of per-event SDK metadata — anonymousId, timestamp, context, messageId, integrations) pushes the batch well over 500 KB.

Fix

Set analytics.sync_mode = True on the client before sending. This makes every track() call a separate blocking HTTP request (~25 KB each) rather than queuing to a background thread for batching. Each request is well within Segment's 32 KB per-request limit.

What was tried first

A batch-size tracking approach was implemented that called flush() before the accumulated data size exceeded 450 KB. End-to-end testing showed it still dropped events:

17 chunks in one batch (~425 KB of data): only 12 arrived
26 chunks split into two batches: only 17 arrived

The estimated 450 KB threshold did not account for the full SDK metadata overhead (~2–3 KB per event), so actual batch bodies still exceeded 500 KB. sync_mode=True eliminates batching entirely and is the only approach that delivered all chunks reliably in testing.

End-to-end validation

Tested against a live Segment source using the exact payload shape produced by flatten_json_report / anonymize_rollups (102 job-type rows, 81 installed-collection rows, ~112 KB total JSON):

Mode	Chunks sent	Chunks received
`sync_mode=False` (async batch)	15	11–14 (flaky)
`sync_mode=True`	15	15 ✓

Large payload stress test (26 chunks, ~650 KB data):

Mode	Chunks sent	Chunks received
Batch approach (450 KB limit)	26	17
`sync_mode=True`	26	26 ✓

Tests

test_put_sync_mode_enabled — verifies analytics.sync_mode is True, all chunks are tracked, and flush() is called exactly once (final flush only — sync_mode handles per-track delivery)
Existing test_put_sends_multiple_chunks_for_large_data — confirms all chunks are tracked and a single flush() fires at the end

References

Segment product limits
analytics-python docs — sync_mode
Regression introduced in Remove use bulk from split into segment #372 which removed the previous BULK_MESSAGE_LIMIT constant
Jira: AAP-73135

Note

Medium Risk
Changes the delivery semantics of analytics emission from async/batched to synchronous per-event HTTP requests, which may impact performance and request timing while improving reliability.

Overview
Ensures Segment analytics uploads no longer rely on the SDK’s async batching by enabling analytics.sync_mode = True before emitting chunked track() events, preventing silent drops when batched payloads exceed Segment’s 500 KB limit.

Updates StorageSegment tests to assert sync_mode is enabled, that all chunks are tracked, and that flush() is still called exactly once; adds a new regression test covering the large-payload batching drop scenario.

^{Reviewed by Cursor Bugbot for commit 9f3eac0. Bugbot is set up for automated code reviews on this repo. Configure here.}

coderabbitai · 2026-04-28T11:20:31Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 4ce6e993-3c4f-492a-906b-e7df1667a987

📥 Commits

Reviewing files that changed from the base of the PR and between 9906c7b and 9f3eac0.

📒 Files selected for processing (2)

metrics_utility/library/storage/segment.py
metrics_utility/test/library/test_storage_segment.py

📝 Walkthrough

Summary by CodeRabbit

Improvements
- Analytics events are now transmitted immediately rather than batched, ensuring faster data delivery to analytics services.
Tests
- Enhanced test coverage with stricter validation of event transmission behavior and improved handling of large data sets.

Walkthrough

The pull request modifies Segment integration in the put method to enable sync_mode, causing each analytics.track() call to execute as a blocking HTTP request instead of being queued asynchronously. Corresponding test updates verify the synchronous behavior and payload correctness.

Changes

Cohort / File(s)	Summary
Segment Synchronous Mode Implementation `metrics_utility/library/storage/segment.py`	Enables `sync_mode` on the analytics instance within the `put` method, changing event tracking from asynchronous batching to synchronous HTTP requests.
Segment Tests `metrics_utility/test/library/test_storage_segment.py`	Strengthened existing test assertions to verify `sync_mode` is enabled and uses `call_count` checks. Added new test to confirm large inputs return chunks with `sync_mode=True` ensuring one `track()` call per chunk and a single `flush()` at completion.

Sequence Diagram(s)

sequenceDiagram
    participant Client as StorageSegment.put()
    participant SDK as Segment SDK<br/>(sync_mode=True)
    participant HTTP as HTTP Layer

    rect rgba(100, 200, 100, 0.5)
    Note over Client,HTTP: New Behavior (Sync Mode)
    end

    loop For each chunk
        Client->>SDK: analytics.track(event, properties)
        activate SDK
        SDK->>HTTP: POST (blocking request)
        HTTP-->>SDK: Response
        deactivate SDK
        SDK-->>Client: Return (waits for completion)
    end

    Client->>SDK: analytics.flush()
    activate SDK
    SDK->>HTTP: Final flush
    HTTP-->>SDK: Response
    deactivate SDK

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: enabling sync_mode in Segment to fix event loss, with the Jira ticket reference.
Description check	✅ Passed	The description covers root cause, fix, testing details, and end-to-end validation results, though it lacks the formal template sections and testing prerequisites/steps.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 60 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Replace time.sleep workaround with analytics.sync_mode = True so each track() call sends synchronously rather than queuing to a background thread, eliminating the race condition where the process exits before the background thread finishes flushing. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Segment silently drops events from batch POSTs that exceed 500KB and returns HTTP 200, making the loss invisible to on_error callbacks. Fix by tracking accumulated batch size and calling flush() before adding a chunk that would push the batch over 450KB (leaving headroom for the per-event metadata the SDK appends). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Cover two new cases: - all chunks fit in one batch (flush called once, final only) - chunks exceed BATCH_SIZE_LIMIT, triggering mid-loop flush (flush called more than once, all chunks still tracked) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…ufficient Testing showed the batch-limit heuristic still dropped events: - 17 chunks in one batch (~425 KB tracked): only 12 arrived - 26 chunks split into two batches: only 17 arrived Root cause: the SDK adds ~2-3 KB of per-event metadata (context, timestamps, messageId, integrations) that our data-size estimate did not account for, pushing actual batch bodies over Segment's 500 KB limit despite our 450 KB threshold. sync_mode=True sends each track() as a separate blocking HTTP request (~25 KB each) instead of batching, which eliminates the batch-size problem entirely. Local end-to-end testing confirmed all 15 chunks arrive reliably with sync_mode=True. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Assert both sync_mode=True and flush.call_count==1, confirming that sync_mode handles per-track delivery and no mid-loop batch flushing is running alongside it. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- Remove stale top comment - Add sync_mode assertion to test_put_sends_data_to_segment so every put() test verifies the mode is set - Rename test_put_sync_mode_enabled -> test_put_sync_mode_no_batch_drops and document the confirmed end-to-end result (15/15 chunks received vs 11-14 without sync_mode) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Each chunk is ~25 KB of JSON but compresses to ~3 KB (87% reduction) due to repeated keys across items. With sync_mode sending one HTTP request per chunk, gzip significantly reduces per-request transfer time and overall upload duration. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Segment's tracking API returns HTTP 200 but discards events when the request body is gzip-encoded, resulting in 0 events received despite the SDK reporting success. gzip=True is not a viable optimisation. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

sonarqubecloud · 2026-04-30T10:44:23Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

cshiels-ie force-pushed the AAP-73135-sync-mode-fix branch from a7393bc to 3ecfaf2 Compare April 28, 2026 11:21

cshiels-ie and others added 2 commits April 28, 2026 12:54

cshiels-ie changed the title ~~Aap 73135 sync mode fix~~ [AAP-73135] Fix Segment event loss by flushing before 500KB batch limit Apr 28, 2026

cshiels-ie changed the title ~~[AAP-73135] Fix Segment event loss by flushing before 500KB batch limit~~ [AAP-73135] Fix Segment event loss by enabling sync_mode Apr 28, 2026

cshiels-ie and others added 3 commits April 28, 2026 13:16

cshiels-ie marked this pull request as ready for review April 28, 2026 15:50

cshiels-ie requested a review from himdel April 29, 2026 16:34

Merge branch 'devel' into AAP-73135-sync-mode-fix

9f3eac0

himdel approved these changes Apr 30, 2026

View reviewed changes

cshiels-ie merged commit 4474a6c into ansible:devel Apr 30, 2026
5 checks passed

cshiels-ie deleted the AAP-73135-sync-mode-fix branch April 30, 2026 13:55

cshiels-ie mentioned this pull request Apr 30, 2026

[AAP-73135] Backport: Fix Segment event loss by enabling sync_mode (stable-0.7) #391

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AAP-73135] Fix Segment event loss by enabling sync_mode#383

[AAP-73135] Fix Segment event loss by enabling sync_mode#383
cshiels-ie merged 9 commits intoansible:develfrom
cshiels-ie:AAP-73135-sync-mode-fix

cshiels-ie commented Apr 28, 2026 •

edited by cursor Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

sonarqubecloud Bot commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cshiels-ie commented Apr 28, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause

Fix

What was tried first

End-to-end validation

Tests

References

Uh oh!

coderabbitai Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

sonarqubecloud Bot commented Apr 30, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cshiels-ie commented Apr 28, 2026 •

edited by cursor Bot

Loading

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading