Skip to content

estuary-cdk: make Task.checkpoint async to improve capture concurrency#3747

Merged
Alex-Bair merged 1 commit intomainfrom
bair/estuary-cdk-async-checkpoints-serialize-emission
Jan 15, 2026
Merged

estuary-cdk: make Task.checkpoint async to improve capture concurrency#3747
Alex-Bair merged 1 commit intomainfrom
bair/estuary-cdk-async-checkpoints-serialize-emission

Conversation

@Alex-Bair
Copy link
Copy Markdown
Member

@Alex-Bair Alex-Bair commented Jan 12, 2026

Description:

When multiple tasks run concurrently in a CDK-based capture, the synchronous Task.checkpoint() method blocks the entire asyncio event loop while writing buffered documents to stdout. This means when one task flushes a large buffer, all other tasks are blocked from fetching data, processing rows, or building up their own document buffers.

We can make Task.checkpoint async in order to improve concurrency. Blocking I/O when flushing a buffer is run with asyncio.to_thread, freeing the main event loop to run other coroutines while a buffer is being flushed. A module-level asyncio.Lock is used to serialize emissions to stdout. This prevents documents and checkpoints from being interleaved or corrupted between tasks trying to flush their buffers.

With this change, while one task holds the lock and emits its buffer, other tasks can continue fetching data from the API, process/parse responses, and capture documents to build up their own buffers. When the emitting task releases the lock, another waiting task can begin flushing its buffer.

This commit makes Task.checkpoint() async, which is a breaking change. All connectors calling Task.checkpoint must now await it. The connectors in the connectors repo affected by this are updated in this commit as well.

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

Tested on a local stack. Confirmed Task.checkpoint no longer blocks the asyncio event loop.

Note: I also made BaseCaptureConnector.checkpoint and BaseCaptureConnector._emit async as well and made them use the same asyncio.Lock to ensure they don't emit to stdout at the same time as any Tasks. Right now, that won't ever happen because BaseCaptureConnector._emit is only called before Task creation, but it's a good future-proofing idea to ensure stdout is behind a lock no matter who's writing to it.

@Alex-Bair Alex-Bair force-pushed the bair/estuary-cdk-async-checkpoints-serialize-emission branch 2 times, most recently from 788d368 to 2cb906f Compare January 12, 2026 21:46
When multiple CDK-basked captures run tasks concurrently, the
synchronous Task.checkpoint() method blocks the entire asyncio event
loop while writing buffered documents to stdout. This means when one
task flushes its buffer, all other tasks are blocked from fetching
data, processing records, or building up their own document buffers.

We can make `Task.checkpoint` async in order to make flushing a buffer
non-blocking to other tasks. Blocking I/O when flushing a buffer is now
run with asyncio.to_thread, freeing the main event loop to run other
coroutines while a buffer is being flushed. A module-level asyncio.Lock
is used to serialize emissions to stdout. This ensures only a single
task flushes its buffer at a time, preveing data from multiple buffers
from being interleaved or corrupted while being written to stdout.

With this change, while one task holds the lock and emits its buffer,
other tasks can continue fetching data from the API, process/parse
responses, and capture documents to build up their own buffers. When
the emitting task releases the lock, another waiting task can begin
flushing its buffer.

This commit makes `Task.checkpoint()` async, which is a breaking change.
All connectors calling `Task.checkpoint` must now await it. The connectors
in the `connectors` repo affected by this are updated in this commit as well.
@Alex-Bair Alex-Bair force-pushed the bair/estuary-cdk-async-checkpoints-serialize-emission branch from 2cb906f to 0ed92a6 Compare January 13, 2026 18:53
@Alex-Bair Alex-Bair marked this pull request as ready for review January 13, 2026 19:00
@Alex-Bair Alex-Bair requested a review from a team January 13, 2026 19:00
Copy link
Copy Markdown
Contributor

@JustinASmith JustinASmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me and aligns with the discussions on Slack! 🚀

@Alex-Bair Alex-Bair merged commit a93702a into main Jan 15, 2026
168 of 173 checks passed
@Alex-Bair Alex-Bair deleted the bair/estuary-cdk-async-checkpoints-serialize-emission branch January 15, 2026 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants