Separate kafka cluster for recordings/monitoring ingestion

Current infra for recordings

```mermaid
flowchart TB
    clickhouse_replay_summary_table[(Session replay summary table)]
    received_snapshot_event[clients send snapshot data]
    kafka_recordings_topic>Kafka recordings topic]
    kafka_clickhouse_replay_summary_topic>Kafka clickhouse replay summary topic]
    

    subgraph recordings ingestion workload
        recordings_consumer-->kafka_clickhouse_replay_summary_topic
        kafka_clickhouse_replay_summary_topic-->clickhouse_replay_summary_table
    end
    subgraph recordings blob ingestion workload
        recordings_blob_consumer-->|1. statefully batch|disk
        disk-->recordings_blob_consumer
        recordings_blob_consumer-->|2. periodic flush|blob_storage
    end
    received_snapshot_event --> capture
    capture -->|compress and chunk to fit into kafka messages| kafka_recordings_topic
    kafka_recordings_topic --> recordings_consumer
    kafka_recordings_topic --> recordings_blob_consumer
```

Let's assume the first change is to teach the consumers to listen to more than one topic so we can cut over without needing to orchestrate multiple releases

I'm also going to assume that we can't avoid some messages being written to both topics during a cutover

We have 

1. recordings consumer
1. blob ingestion consumer

# recordings consumer

* writes individual events to session_recording_events table
* they are relatively immune to duplication
* because CH will de-duplicate based on team_id, toHour(timestamp), session_id, timestamp, uuid

writes also summarised data to session_replay_events table, so that we can eventually stop writing to session_recording_events table

* less able to handle duplication because of that aggregation
* CH will ignore writes it thinks are duplicates (we can play with https://clickhouse.com/docs/en/operations/settings/merge-tree-settings#replicated-deduplication-window) to tune this if needed
* we can tolerate _some_ duplicates since it means that counts will be off but if we entirely miss a session then the user loses data

Neither consumer cares particularly about anything kafka specific e.g. offset value

# blob ingestion consumer

* reads messages from kafka
* does not auto-commit
* buffers them on disk (and a little in memory)
* periodically flushes to S3
* and on S3 flush might commit an offset to kafka
* uses the fact that offset will always increment to track progress

Here's the one that will "break", because I guess the kafka offset will reset to 0

"Break" here means that we would write to S3 even if we had already written to S3 for any session that has messages on both topics.  

If we wanted to be super resilient we'd make a change to track sessions by topic and not accept messages on the new topic if they were for sessions on the old topic. Let that "drain" and  then tidy up the code once all active sessions are on the new topic.

A quick google suggests that you can set partition offsets but that feels more confusing :sweat_smile: 

TBH, right now we're double processing anyway as we work through all the differences between the theory and practice of the new consumer. So, if we can cut over relatively quickly then we're not adding much to the risk. 

Without wanting to speak for @benjackwhite  too much, I don't think we know kafka well enough to do this all by ourselves 😅 

So, I'm thinking

1, 
set up separate capture deployment that only handles session recording traffic and writes to a different MSK cluster?

2, and 3 at the same time
teach the two consumers to use multiple kafka clusters/topics. they read from both at the same time

4
test that in dev

5 
roll that to EU

6 
roll that to prod

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Separate kafka cluster for recordings/monitoring ingestion #15832

recordings consumer

blob ingestion consumer

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Separate kafka cluster for recordings/monitoring ingestion #15832

Description

recordings consumer

blob ingestion consumer

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions