Skip to content

Separate kafka cluster for recordings/monitoring ingestion #15832

Open
@pauldambra

Description

@pauldambra

Current infra for recordings

flowchart TB
    clickhouse_replay_summary_table[(Session replay summary table)]
    received_snapshot_event[clients send snapshot data]
    kafka_recordings_topic>Kafka recordings topic]
    kafka_clickhouse_replay_summary_topic>Kafka clickhouse replay summary topic]
    

    subgraph recordings ingestion workload
        recordings_consumer-->kafka_clickhouse_replay_summary_topic
        kafka_clickhouse_replay_summary_topic-->clickhouse_replay_summary_table
    end
    subgraph recordings blob ingestion workload
        recordings_blob_consumer-->|1. statefully batch|disk
        disk-->recordings_blob_consumer
        recordings_blob_consumer-->|2. periodic flush|blob_storage
    end
    received_snapshot_event --> capture
    capture -->|compress and chunk to fit into kafka messages| kafka_recordings_topic
    kafka_recordings_topic --> recordings_consumer
    kafka_recordings_topic --> recordings_blob_consumer
Loading

Let's assume the first change is to teach the consumers to listen to more than one topic so we can cut over without needing to orchestrate multiple releases

I'm also going to assume that we can't avoid some messages being written to both topics during a cutover

We have

  1. recordings consumer
  2. blob ingestion consumer

recordings consumer

  • writes individual events to session_recording_events table
  • they are relatively immune to duplication
  • because CH will de-duplicate based on team_id, toHour(timestamp), session_id, timestamp, uuid

writes also summarised data to session_replay_events table, so that we can eventually stop writing to session_recording_events table

Neither consumer cares particularly about anything kafka specific e.g. offset value

blob ingestion consumer

  • reads messages from kafka
  • does not auto-commit
  • buffers them on disk (and a little in memory)
  • periodically flushes to S3
  • and on S3 flush might commit an offset to kafka
  • uses the fact that offset will always increment to track progress

Here's the one that will "break", because I guess the kafka offset will reset to 0

"Break" here means that we would write to S3 even if we had already written to S3 for any session that has messages on both topics.

If we wanted to be super resilient we'd make a change to track sessions by topic and not accept messages on the new topic if they were for sessions on the old topic. Let that "drain" and then tidy up the code once all active sessions are on the new topic.

A quick google suggests that you can set partition offsets but that feels more confusing 😅

TBH, right now we're double processing anyway as we work through all the differences between the theory and practice of the new consumer. So, if we can cut over relatively quickly then we're not adding much to the risk.

Without wanting to speak for @benjackwhite too much, I don't think we know kafka well enough to do this all by ourselves 😅

So, I'm thinking

1,
set up separate capture deployment that only handles session recording traffic and writes to a different MSK cluster?

2, and 3 at the same time
teach the two consumers to use multiple kafka clusters/topics. they read from both at the same time

4
test that in dev

5
roll that to EU

6
roll that to prod

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions