source monday: use dict page cursor for items backfill#3068
source monday: use dict page cursor for items backfill#3068JustinASmith merged 3 commits intomainfrom
Conversation
0a0e93d to
8089097
Compare
|
@Alex-Bair I have this PR in separate commits for easier review. Will likely consolidate into three commits. One for CDK changes, one for I will also address some of the items Copiolot Review pointed out once I do a final force push. |
Alex-Bair
left a comment
There was a problem hiding this comment.
Looks really good, nice job! I had some minor nits, mostly aimed around answering questions I know I'll have down the road once I lose context around the reason for these changes.
I haven't approved yet because I'd like to take the CDK changes for a spin & make sure I'm not missing any edge cases. I'll try to do a little testing either this weekend or on Monday so you're not waiting on me for too long.
|
Thanks @Alex-Bair! I'll get to reviewing your comments in more detail soon. Good points though. Having more comments and docstrings will help future us! |
Implements in-place JSON merge patching for cursor state updates, allowing efficient incremental tracking and checkpointing. Streamlines backfill logic to properly handle structured cursors and emit precise state patches for improved runtime efficiency. Addresses need for flexible cursor management and efficient state updates in paged result processing.
Simplifies the logic for fetching items from boards by removing the iterator class, consolidating the process into standalone async functions, and clarifying error handling for invalid inputs. Fixes a bug in cursor collection to support both board-level and next-page cursors, enhancing reliability and maintainability of pagination. Removes unused logic related to tracking the oldest updated board, focusing on streamlined item retrieval.
9ceead6 to
7729065
Compare
This addresses a fundamental limitation and issue with the prior implementation by using a dictionary for tracking boards that need their items backfilled. This leverages the CDK so that the initial page of all boards is first yielded and only patches for processed boards are yieled on subsequent invocations fo `fetch_items_page`. This improves the reliability, efficiency and consistency in backfilling items, while working around the API limitations that Monday.com presents.
7729065 to
8155d52
Compare
There was a problem hiding this comment.
Pull Request Overview
This pull request introduces dict-based page cursors for the Monday.com connector's items backfill functionality, replacing the previous timestamp-based approach. The changes improve granular progress tracking and fault tolerance by enabling JSON merge patches for efficient incremental state updates.
- Implements structured cursor support in the Estuary CDK with JSON merge patch functionality for efficient state updates
- Refactors Monday.com items backfill to use board-level tracking instead of timestamp-based pagination
- Simplifies the items fetching logic by removing the
BoardItemIteratorwrapper class
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| estuary-cdk/estuary_cdk/utils.py | Adds JSON merge patch utility function for RFC 7396-compliant dictionary updates |
| estuary-cdk/estuary_cdk/capture/common.py | Introduces dict-based cursor support with marker constants and merge patch handling in backfill logic |
| source-monday/source_monday/models.py | Adds ItemsBackfillCursor dataclass for type-safe cursor operations and removes unused items_count field |
| source-monday/source_monday/api.py | Refactors fetch_items_page to use dict-based cursors and board-level tracking instead of timestamp pagination |
| source-monday/source_monday/graphql/items.py | Removes BoardItemIterator class and simplifies to direct function calls with improved cursor collection |
| source-monday/source_monday/graphql/boards.py | Increases query limit from 500 to 10,000 and removes items_count from GraphQL query |
| source-monday/source_monday/graphql/init.py | Updates exports to replace BoardItemIterator with get_items_from_boards function |
Description:
See #3047 (#3047 (comment)) for more context.
This pull request introduces enhancements to cursor handling, backfill logic, and JSON merge patch functionality in the Estuary CDK and Source Monday integrations. The changes aim to improve efficiency, scalability, and fault tolerance, particularly in handling paginated data and structured cursors for the
itemsstream. This PR also simplified theitems.pyBoardItemIteratorby removing the unnecessary class that wrapped the functions. Below is a categorized summary of the most important changes:Cursor Handling Enhancements
CURSOR_MARKERconstant and utility functions (is_cursor_dict,make_cursor_dict,pop_cursor_marker) to support dict-based cursors for structured and granular progress tracking. (estuary-cdk/estuary_cdk/capture/common.py, estuary-cdk/estuary_cdk/capture/common.pyL53-R78)PageCursortype to includedictfor structured cursors and added support for JSON merge patches to enable efficient incremental updates. (estuary-cdk/estuary_cdk/capture/common.py, estuary-cdk/estuary_cdk/capture/common.pyL53-R78)Backfill Logic Improvements
_binding_backfill_taskfunction to handle dict-based cursors, including logic for JSON merge patches and proper checkpointing of incremental updates. (estuary-cdk/estuary_cdk/capture/common.py, estuary-cdk/estuary_cdk/capture/common.pyL882-R912)_initialize_connector_stateas a helper function for better encapsulation of connector state initialization. (estuary-cdk/estuary_cdk/capture/common.py, estuary-cdk/estuary_cdk/capture/common.pyR869)JSON Merge Patch Implementation
json_merge_patchutility function to apply RFC 7396-compliant JSON merge patches, enabling efficient in-place updates to dictionaries. (estuary-cdk/estuary_cdk/utils.py, estuary-cdk/estuary_cdk/utils.pyR4-R38)Source Monday Integration Updates
BoardItemIteratorwithget_items_from_boardsfor improved cursor handling and streamlined item fetching. (source-monday/source_monday/api.py, [1] [2]fetch_items_pagefunction to use dict-based cursors for granular progress tracking and efficient recovery. (source-monday/source_monday/api.py, source-monday/source_monday/api.pyL320-R399)GraphQL Query and Cursor Processing Refinements
fetch_boards_minimalquery limit from 500 to 10,000 for better scalability. (source-monday/source_monday/graphql/boards.py, source-monday/source_monday/graphql/boards.pyL54-R65)CursorCollectorclass by removing unused attributes and improving cursor processing for nested structures likenext_items_page. (source-monday/source_monday/graphql/items.py, [1] [2]Workflow steps:
(How does one use this feature, and how has it changed)
Documentation links affected:
(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)
Notes for reviewers:
I tested this on a local stack.
itemsstream works as expected.fetch_items_pageinvocations fetched items for all boards in the current invocation and emitted a patch checkpoint for those boards in the form{"board1": None, "board2": None}to remove those boards from the connector state and not bloat the recovery log.gazctlto read the Gazette recovery log and verified at that lower-level the recovery log only contained the initial full checkpointed pager cursor dictionary followed by incremental patches or updates which were setting board IDs tonull(i.e., removing them from the state by merge patching).This change is