Skip to content

Session replay loses most in-app webview (Instagram/Facebook) sessions — recorder flush is a per-event debounce, not a periodic interval #4320

@rakurtz

Description

@rakurtz

Describe the Bug

Area: Session replay / recorder

Summary

A large share of our traffic arrives through in-app browsers (Instagram/Facebook WebViews on Android). For these visitors, session replays are almost always unusable: they show up with high action counts but 0:00 duration, and playback only shows the initial page render, not the rest of the session. Real desktop/mobile-Chrome sessions record fine.

After reading the recorder source, this looks like a consequence of how events are flushed, combined with WebViews not firing page-lifecycle events reliably.

Root cause

In src/recorder/index.js the time-based flush is implemented as a debounce that is reset on every emitted rrweb event:

const FLUSH_INTERVAL = 10000;

const scheduleFlush = () => {
  if (flushTimer) clearTimeout(flushTimer);   // reset on every event
  flushTimer = setTimeout(flush, FLUSH_INTERVAL);
};

// inside record({ emit(event) { ... } }):
eventBuffer.push(event);
if (eventBuffer.length >= FLUSH_EVENT_COUNT) {  // 100
  flush();
}
scheduleFlush();   // called for every single event

So the buffer is only sent when one of these happens:

the buffer reaches

  • FLUSH_EVENT_COUNT (100 events), or
  • 10s pass with no events at all (the debounce can only settle during idle), or
  • visibilitychange → hidden / beforeunload fires (keepalive flush), or
  • maxDuration is reached.

For an actively interacting user, rrweb emits events continuously, so the debounce never settles and the only mid-session delivery is the coarse 100-event threshold. The final partial chunk (<100 events) depends entirely on the unload handlers in beginRecording():

document.addEventListener('visibilitychange', () => {
  if (document.visibilityState === 'hidden') flush(true);
});
window.addEventListener('beforeunload', () => flush(true));

n Android in-app WebViews, beforeunload/visibilitychange are frequently not fired when the host app tears down or backgrounds the WebView, so that final flush never happens.

The most damaging case: the first chunk contains rrweb's full snapshot (type 2). If a WebView visitor bounces before any flush lands and no unload event fires, the entire recording — snapshot included — is lost, which is why so many sessions are unplayable / 0:00.

Impact

Session replay is effectively non-functional for in-app WebView traffic (a major share of paid-social/mobile audiences).
Even when a recording is stored, it often contains only the load-time mutation burst (clustered timestamps → 0:00 duration), giving a misleading picture in the replay list.

Suggested fix

Replace the per-event debounce with a true periodic interval started once when recording begins, and lower the interval. This sends buffered events while the page is still alive, so capture no longer depends on unload events firing:

const FLUSH_INTERVAL = 2000; // periodic, not debounced

// remove scheduleFlush() and its per-event call in emit()

// in beginRecording(), after record({...}):
flushTimer = setInterval(() => flush(), FLUSH_INTERVAL);

// in stop():
if (flushTimer) clearInterval(flushTimer);

As a complementary improvements i suggest to add a pagehide listener (more reliably fired than beforeunload on mobile WebKit/Chromium):

window.addEventListener('pagehide', () => flush(true));

Optionally also make the flush interval configurable via a data-flush-interval attribute (default could stay conservative) so high-traffic sites can tune request volume vs. capture fidelity.
With a periodic interval, at most the last ~2s of a session is ever lost, and the initial full snapshot reaches the server within the first interval, making WebView recordings reliably playable.

Notes / tradeoffs

A shorter interval increases request count (e.g. up to ~150 small POSTs over a 5-minute session). Keeping it at 2–3s and/or gating behind the existing sample rate keeps this reasonable; a configurable attribute would let operators decide.

We have not opened a PR because we haven't load-tested the change in our environment, but we're happy to if the team agrees with the direction.

Database

PostgreSQL

Relevant log output

Which Umami version are you using?

3.1.0 (observed in src/recorder/index.js, ref c78ff36)

How are you deploying your application?

self-hosted, PostgreSQL

Which browser are you using?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions