Skip to content

source-posthog: turn persons into an incremental stream#3957

Merged
nicolaslazo merged 2 commits intomainfrom
nlazo/source-posthog-fix-persons-oom
Mar 16, 2026
Merged

source-posthog: turn persons into an incremental stream#3957
nicolaslazo merged 2 commits intomainfrom
nlazo/source-posthog-fix-persons-oom

Conversation

@nicolaslazo
Copy link
Copy Markdown
Contributor

Description:

PostHog's persons stream is a snapshot type, which is currently filling up storage space in some reactors as it tries to accumulate thousands of documents in memory without checkpointing its progress.

This PR implements secondary cursor support for HogQL queries. This allows us to rely on last_seen_at for update timestamps, falling back to created_at when not available.

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

Tested locally through flowctl preview.

@nicolaslazo nicolaslazo requested a review from a team March 4, 2026 21:06
@nicolaslazo nicolaslazo self-assigned this Mar 4, 2026
@nicolaslazo nicolaslazo force-pushed the nlazo/source-posthog-fix-persons-oom branch from 20e74eb to b38edd1 Compare March 4, 2026 21:12
Copy link
Copy Markdown
Member

@Alex-Bair Alex-Bair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a few items to address before approving. This comment about adding new projects to the Persons resource state looks like it also applies to the Events stream. I missed that when I reviewed the connector initially, but we'll want to improve that handling soon, either in this PR or in a follow up one.

Comment thread estuary-cdk/estuary_cdk/capture/common.py Outdated
Comment on lines +18 to +24
project_id:
anyOf:
- type: integer
- type: 'null'
default: null
description: The PostHog project this document belongs to
title: Project Id
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_meta/project_id is part of the composite key, and we shouldn't allow it to be nullable in order to make it easier to materialize Persons to an SQL destination. If we did have nullable keys, SQL materializations wouldn't be able to materialize Persons out of the box since SQL destinations reject nullable primary keys.

Comment thread source-posthog/source_posthog/api.py Outdated
Comment on lines +417 to +441
async for item in _query_hogql(
Person,
new_cursor,
cutoff,
base_url,
project_id,
http,
log,
):
batch_count += 1
doc_count += 1
item_cursor = item.get_cursor()
new_cursor = max(new_cursor, item_cursor)

while True:
async for item in _query_hogql(
Person, project_cursor, None, base_url, project_id, http, log
):
doc_count += 1
project_cursor = item.get_cursor()
yield item

yield item
if batch_count < HOGQL_PAGE_SIZE:
break

if datetime.now(tz=UTC) > backfill_timeout:
log.info(
f"{BACKFILL_TIMEOUT_PERIOD.total_seconds() / 60} "
+ "minutes have elapsed, emitting a checkpoint"
)
break
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking through an edge case, what happens if there is a long run of records all with the same cursor value? Is it possible that we the following could happen?

  • We fetch the first handful of records at cursor N.
  • We stop fetching the remaining records at cursor N because the BACKFILL_TIMEOUT_PERIOD has elapsed.
  • We yield N as the next cursor value.
  • We reinvoke backfill_persons and have it start fetching records with cursor values greater than N. Meaning, we miss the remaining records at cursor N that would have taken more than BACKFILL_TIMEOUT_PERIOD to fetch.

That looks possible to me, but let me know if you disagree. I've handled the above situation in other connectors by only breaking out of the pagination loop early when both BACKFILL_TIMEOUT_PERIOD has elapsed and the next document's cursor value is greater than the previous document's cursor value. Here's an example from source-klaviyo-native where I implemented that logic:

# If we see a document updated more recently than the previous
# document we emitted, checkpoint the previous documents.
if (
count >= CHECKPOINT_INTERVAL and
doc_cursor > last_seen_dt
):
yield dt_to_str(last_seen_dt)
count = 0
# If backfill_incremental_resources has been running for more than TARGET_FETCH_PAGE_INVOCATION_RUN_TIME
# minutes, then yield control back to the CDK after a checkpoint. This forces backfills to check the
# stopping event every so often & gracefully exit if it's set.
if time.time() - start_time >= TARGET_FETCH_PAGE_INVOCATION_RUN_TIME:
return

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've already had to tackle a similar issue in source-quickbooks. Like mentioned on our 1:1, I'll try to extract the checkpointing logic to a decorator abstract enough it could become a CDK feature 👍

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That backfill_timeout decorator is interesting! It looks like it may be general enough to use elsewhere as long as the item's cursor value is a datetime that's exposed via the get_cursor method. It may also be worth having a comment or docstring stating that the decorator should only be used when the backfill_fn yields documents in ascending order, just so it's easier for us to remember the specific situations this decorator should be used in. Let's monitor how it does, and if we like it, we can figure out how to abstract it into the CDK somehow.

Comment thread source-posthog/source_posthog/resources.py
Comment thread source-posthog/source_posthog/models.py Outdated
Some linters complain about the `dict` variant in the PageCursor
definition not having any type parameters set.
@nicolaslazo nicolaslazo force-pushed the nlazo/source-posthog-fix-persons-oom branch 2 times, most recently from a30b4a9 to fec8523 Compare March 9, 2026 17:57
@nicolaslazo nicolaslazo requested a review from Alex-Bair March 9, 2026 18:13
Copy link
Copy Markdown
Member

@Alex-Bair Alex-Bair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % small nit of regenerating the schema.yaml files with flowctl raw discover to reflect that project_id is now non-nullable.

Note: We'll need to perform dataflow resets for all production captures' Persons binding due to the change in the key field.

@nicolaslazo nicolaslazo force-pushed the nlazo/source-posthog-fix-persons-oom branch from fec8523 to 1e97768 Compare March 14, 2026 03:43
@nicolaslazo nicolaslazo force-pushed the nlazo/source-posthog-fix-persons-oom branch from 1e97768 to 400330f Compare March 16, 2026 12:26
@nicolaslazo nicolaslazo merged commit 0929557 into main Mar 16, 2026
118 of 126 checks passed
@nicolaslazo nicolaslazo deleted the nlazo/source-posthog-fix-persons-oom branch March 16, 2026 15:06
@Alex-Bair Alex-Bair mentioned this pull request Mar 18, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants