Skip to content

source-monday: refactor and simplify to fix data discrepancies#3020

Merged
JustinASmith merged 1 commit intomainfrom
js/monday-simplification
Jul 10, 2025
Merged

source-monday: refactor and simplify to fix data discrepancies#3020
JustinASmith merged 1 commit intomainfrom
js/monday-simplification

Conversation

@JustinASmith
Copy link
Copy Markdown
Contributor

@JustinASmith JustinASmith commented Jul 3, 2025

Description:

This PR refactors the Monday.com source connector's GraphQL architecture to address data discrepancy issues and improve performance through modular design and intelligent caching.

Key Changes

🏗️ Architectural Refactoring

  • Modular GraphQL structure: Restructured monolithic graphql.py (1109 lines) into specialized modules:
    • graphql/activity_logs.py - Activity log operations
    • graphql/boards.py - Board management
    • graphql/items/ - Items processing with caching
    • graphql/query_executor.py - Centralized query execution with IncrementalJsonProcessor
    • graphql/constants.py - Shared constants

🚀 Performance & Memory Improvements

  • Memory-efficient streaming: Implemented IncrementalJsonProcessor for large GraphQL responses
  • Item cache system: Intelligent caching for backfill operations, reducing API calls and enabling meaningful checkpoints
  • API complexity management: Proactive GraphQL complexity tracking to prevent rate limits
  • Optimized batch processing: Configurable batch sizes maximizing API limits while maintaining low memory usage (<100MB during 10K+ board and 250K+ items backfill)

🔧 Data Consistency Fixes

  • Robust timestamp handling: Improved parsing of Monday.com's 17-digit timestamps
  • Better incremental sync: Enhanced boundary condition handling and cursor tracking
  • Reliable state management: Consistent checkpointing for resumable operations

🛠️ Code Quality Improvements

  • Enhanced error handling: Centralized error management with improved logging and debugging
  • Separation of concerns: Clear component boundaries and extracted common utilities
  • API compliance: Better adherence to Monday.com API best practices

Impact

Before: Data discrepancies due to performance bottlenecks preventing checkpoints on large accounts, inefficient API usage causing rate limits, and complex tightly-coupled code.

After: Reliable data synchronization with meaningful progress tracking, optimized API usage with proactive rate limit management, and maintainable modular architecture.

Testing

  • ✅ Tested on local stack with 10K+ boards and 250K+ items backfill
  • ✅ Memory usage monitored and kept under 100MB
  • ✅ All qualifying boards (active/archived) backfilled efficiently in less than 1 hour
  • ✅ All qualifying items backfilled effectively in less than 2 hours
  • ✅ API rate limits respected with proactive delays - CDK rate limits with backoff can still occur, but far less frequent

Workflow steps: No changes to user workflow - connector maintains same interface and functionality.

Documentation links affected: None - internal refactoring only.

Notes for reviewers:

  • Focus on the modular architecture in source_monday/graphql/ directory
  • Item cache implementation provides significant performance gains for large accounts
  • All functionality preserved while improving reliability and maintainability

This change is Reviewable

@JustinASmith JustinASmith marked this pull request as ready for review July 3, 2025 21:55
@JustinASmith JustinASmith requested a review from Alex-Bair July 3, 2025 21:56
@JustinASmith
Copy link
Copy Markdown
Contributor Author

@Alex-Bair This is marked as Ready for Review. However, I still need to finish up the PR description (mostly AI generated at the moment) and force push with a single commit detailing the changes concisely

@JustinASmith JustinASmith requested a review from Copilot July 3, 2025 22:02

This comment was marked as outdated.

Copy link
Copy Markdown
Member

@Alex-Bair Alex-Bair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to review graphql.py and utils.py, but I had a few comments/questions so far.

Comment thread source-monday/source_monday/item_cache.py Outdated
Comment thread source-monday/source_monday/api.py Outdated
Comment thread source-monday/source_monday/api.py Outdated
Comment thread source-monday/source_monday/api.py Outdated
yield item
log.debug(
f"Item {item.id} marked as deleted (updated: {item.updated_at})"
window_has_updates = True
Copy link
Copy Markdown
Member

@Alex-Bair Alex-Bair Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should window_has_updates be reassigned to True whether or not the item was deleted? If so, both that reassignment and the max_update_at_in_window one could be moved one level up, like:

if window_start <= item.updated_at <= window_end:
   window_has_updates = True
   max_updated_at_in_window = max(
       max_updated_at_in_window, item.updated_at
   )
   if item.state == "deleted":
       item.meta_ = Item.Meta(op="d")
       yield item
   else:
       item_ids_to_fetch.add(item.id)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! Though, the way I have it should be correct. The reason is that I only set window_has_updates to True when we actually yield something from the function. I could probably rename this or just use a docs_count counter and check docs_count > 0 to determine when we need to move the cursor forward.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll use a counter to make it a bit more clear when it is used and why.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and I will remove items_yielded since it is not correctly tracking in all phases of this functions logic.

Comment thread source-monday/source_monday/api.py Outdated
yield max_updated_at_in_window + timedelta(seconds=1)
else:
log.debug("Incremental sync complete. No updates found.")
yield window_end
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to move the cursor forward if there are no updates? This would be necessary if the cursor expires after a while or the source system has limited data retention (like Stripe only keeping ~30 days worth of events), and it might be necessary here to reduce API calls somehow? My suspicion is that fetch_items_changes checks a fixed date window so the binding can make smaller, incremental updates that require fewer API calls if it falls behind the present?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question! This is actually a critical design decision related to Monday.com's API limitations. Monday.com only retains the most recent 10,000 activity logs per board. If we don't move the cursor forward when there are no updates, we risk getting stuck in a scenario where:

  1. Log Retention Issue: If we keep checking the same time window (or barley move the cursor) repeatedly without updates, and meanwhile new activity logs are being generated on the board, the older logs (that we haven't processed yet) could get purged from the 10,000 log limit.
  2. Data Loss Prevention: If we stayed at the same cursor position and the board had high activity, we could miss processing logs that existed during our previous check but were later purged due to the 10k retention limit.

Since the check to see if there are updates uses both a minimal query to check updated_at and the activity_logs I am confident we can safely move the log cursor forward. However, if you see something I do not now that you know my reasoning, let me know!

I'll add a comment in the code to explain this Monday.com-specific behavior for future me :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, that make sense then that we have to advance the cursor then. Thanks for adding that comment!

Comment thread source-monday/source_monday/models.py Outdated
Comment thread source-monday/source_monday/item_cache.py Outdated
Copy link
Copy Markdown
Member

@Alex-Bair Alex-Bair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a few more comments/questions. I like the overall direction you're heading, simplifying the connector so it's easier to figure out what's happening & troubleshoot when something goes wrong.

Comment thread source-monday/source_monday/graphql.py Outdated
)
raise ValueError("Query modification failed. Cannot find query body.")

if complexity_cache.next_query_delay > 0:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I missed it, but it doesn't look like accessing or setting complexity_cache is coordinated between multiple execute_query invocations. Does this complexity budget tracking strategy work without some kind of coordination? I imagine having separate execute_query coroutines accessing complexity_cache without any coordination would experience read-modify-write race conditions. Multiple coroutines can read the same shared state, make decisions based on that state, and then modify it - but the state they read becomes stale as soon as another coroutine modifies it.

If this works better than the current strategy, let's go with it. Managing complexity limits for GraphQL APIs seems pretty tricky, and I bet properly managing this with some kind of complexity reservation system/leaky bucket algorithm would take some time to work out.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed complexity tracking and rate limiting except for setting a LOW_COMPLEXITY_BUDGET threshold of 10K to delay and wait for Monday's rate limit bucket to refill before continuing to request data. This seemed to work well enough and reduced the complexity of my prior implementation.

I also implemented a Leaky Bucket Rater Limiter, but that too was not worth the complexity. The main goal was to reduce the chance we hit many HTTP 429 errors and subsequently rack up longer and longer delays in the CDKs Rate Limiter which is just an exponential backoff strategy.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there should not be an issue with multiple coroutines using this since this is handled per-query. There may still be HTTP 429 errors that come up, but this threshold and delay approach was working fine in testing even with many coroutines using the query executor

Comment thread source-monday/source_monday/utils.py Outdated
Comment thread source-monday/source_monday/graphql.py Outdated
@JustinASmith JustinASmith requested a review from Copilot July 10, 2025 01:42

This comment was marked as outdated.

@JustinASmith JustinASmith force-pushed the js/monday-simplification branch from fcefbbd to 61dd163 Compare July 10, 2025 14:11
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the Monday.com connector to improve modularity, performance, and data consistency by:

  • Splitting the monolithic GraphQL file into domain-specific modules under source_monday/graphql/
  • Adding an item cache system and streaming processor for memory-efficient large-response handling
  • Updating default sync intervals and enhancing timestamp parsing and complexity tracking

Reviewed Changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/snapshots/* Updated expected sync intervals in snapshots
test.flow.yaml Aligned default intervals for all resources
source_monday/utils.py Added robust 17-digit timestamp parser
source_monday/resources.py Changed full/incremental fetch to use query constants and updated default intervals
source_monday/models.py Streamlined models, added complexity info fields
source_monday/graphql/query_executor.py Centralized GraphQL executor with complexity limits and retries
source_monday/graphql/items/* New modular item fetch logic and cache adapter
source_monday/graphql/boards.py Batch-aware board fetchers
source_monday/graphql/activity_logs.py Activity log fetch with improved timestamp filtering
source_monday/graphql/constants.py Introduced centralized GraphQL queries
source_monday/api.py Switched to new fetch and cache session API
source_monday/init.py Minor signature adjustment for request_class
pyproject.toml Added aiostream dependency
Comments suppressed due to low confidence (2)

source-monday/source_monday/graphql/items/item_cache.py:216

  • [nitpick] This helper returns an empty async generator but is undocumented. Adding a one-line docstring to explain that it exists purely to satisfy the async generator signature would improve readability.
    async def _empty_generator(self) -> AsyncGenerator[Item, None]:

source-monday/source_monday/graphql/items/item_cache.py:1

  • The new ItemCacheSession and its methods (process_page, _stream_items_from_cache, etc.) are critical for backfill performance but lack direct test coverage. Consider adding unit tests that simulate multiple boards/items and edge cases such as no remaining boards or items.
import itertools

return

except GraphQLQueryError as e:
complexity_error = False
Copy link

Copilot AI Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The complexity_error flag is initialized but never set to True when a complexity-related error is detected. This prevents the retry logic from distinguishing complexity errors; consider setting complexity_error = True inside the relevant if error.extensions.complexity > error.extensions.maxComplexity or other complexity branches.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Alex-Bair this is a good catch by copilot. I am going to remove this complexity_error variable, keeping the same logic. This is what I'll force push once you are done reviewing.

diff --git a/source-monday/source_monday/graphql/query_executor.py b/source-monday/source_monday/graphql/query_executor.py
index d5f424e0..153c34cc 100644
--- a/source-monday/source_monday/graphql/query_executor.py
+++ b/source-monday/source_monday/graphql/query_executor.py
@@ -167,7 +167,6 @@ async def execute_query(
             return
 
         except GraphQLQueryError as e:
-            complexity_error = False
             for error in e.errors:
                 if error.extensions:
                     if error.extensions.complexity and error.extensions.maxComplexity:
@@ -198,9 +197,10 @@ async def execute_query(
                         },
                     )
                     await asyncio.sleep(COMPLEXITY_RESET_WAIT_SECONDS)
+                    attempt += 1
                     break
 
-            if not complexity_error and attempt == MAX_RETRY_ATTEMPTS:
+            if attempt == MAX_RETRY_ATTEMPTS:
                 log.error(
                     "GraphQL streaming query failed permanently",
                     {
@@ -213,23 +213,21 @@ async def execute_query(
                 )
                 raise
 
-            if not complexity_error:
-                retry_delay = attempt * 2
-                log.warning(
-                    "GraphQL query failed - retrying with exponential backoff",
-                    {
-                        "error": str(e),
-                        "attempt": attempt,
-                        "max_attempts": MAX_RETRY_ATTEMPTS,
-                        "query_preview": modified_query[:100] + "..." if len(modified_query) > 100 else modified_query,
-                        "variables": variables,
-                        "json_path": json_path,
-                        "response_model": cls.__name__,
-                        "retry_delay_seconds": retry_delay,
-                    },
-                )
-                await asyncio.sleep(retry_delay)
-
+            retry_delay = attempt * 2
+            log.warning(
+                "GraphQL query failed - retrying with exponential backoff",
+                {
+                    "error": str(e),
+                    "attempt": attempt,
+                    "max_attempts": MAX_RETRY_ATTEMPTS,
+                    "query_preview": modified_query[:100] + "..." if len(modified_query) > 100 else modified_query,
+                    "variables": variables,
+                    "json_path": json_path,
+                    "response_model": cls.__name__,
+                    "retry_delay_seconds": retry_delay,
+                },
+            )
+            await asyncio.sleep(retry_delay)
             attempt += 1
 
         except Exception as e:

Copy link
Copy Markdown
Member

@Alex-Bair Alex-Bair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % question around fewer fields in the capture snapshot for items.

Comment thread source-monday/source_monday/api.py Outdated
return dt.strftime(DATETIME_STRING_FORMAT)


def _str_to_dt(string: str) -> datetime:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It doesn't look like _str_to_dt is used anywhere?

Comment thread source-monday/source_monday/api.py Outdated
Comment on lines +50 to +68
async def _get_or_create_item_cache_session(
http: HTTPSession, log: Logger, cutoff: datetime
) -> ItemCacheSession:
global _item_cache_session

if _item_cache_session is None or _item_cache_session.cutoff != cutoff:
log.debug(f"Creating new item cache session for cutoff: {cutoff}")
_item_cache_session = ItemCacheSession(
http=http,
log=log,
cutoff=cutoff,
boards_batch_size=100,
items_batch_size=500,
)
await _item_cache_session.initialize()

return _item_cache_session
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Instead of having _get_or_create_item_cache_session initialize the global variable _item_cache_session, can we instead initialize an ItemCacheSession when we're creating the items resource in resources.py and thread it into fetch_items_page? I think that'd be clearer, especially since the ItemCacheSession is only used within fetch_items_page and nowhere else.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: None of the changes in this file are needed, are they?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I blame Claude Code. 😆 I'll revert that since it was just things that satisfy mypy and/or reordering imports.

Comment thread source-monday/source_monday/api.py Outdated
Comment on lines +280 to +284
else:
log.warning(
f"Item {item.id} updated_at {item.updated_at} is outside the sync window "
f"({window_start} to {window_end}). This should not happen with the query filter."
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Trying to confirm my understanding. We have to add a day to end since the API uses a daily granularity, and we're expecting there to be some items returned that have been updated after window_end. We're filtering those out client-side, which makes sense. But that would mean we're expecting to reach this warning log in the else branch occasionally. The log message sounds like this should never happen, but it seems like it's actually expected?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. That is true. It is expected and I'll remove that log message 🤦 . That is just what is going to happen with the daily granularity constraint and needing to move the window_end a day forward.

yield max_updated_at_in_window + timedelta(seconds=1)
else:
log.debug("Incremental sync complete. No updates found.")
yield window_end
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, that make sense then that we have to advance the cursor then. Thanks for adding that comment!

json_path: str,
query: str,
variables: dict[str, Any] | None = None,
remainder_cls: type[TRemainder] = GraphQLResponseRemainder, # type: ignore[assignment]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It doesn't look like we need the #type: ignore[assignment], do we? If I remove it, I don't see any type errors/warnings.

Comment on lines -124 to -165
"assets": [],
"board": {
"id": "9323816949",
"name": "jjjjj"
"id": "8431289505"
},
"column_values": [
{
"id": "person",
"text": "",
"type": "people",
"value": null
},
{
"id": "status",
"text": "Working on it",
"type": "status",
"value": "{\"index\":0,\"post_id\":null,\"changed_at\":\"2019-03-01T17:24:57.321Z\"}"
},
{
"id": "date4",
"text": "2025-06-04",
"type": "date",
"value": "{\"date\":\"2025-06-04\",\"icon\":null,\"changed_at\":\"2025-06-06T20:38:20.505Z\"}"
}
],
"created_at": "2025-06-06T20:38:19Z",
"creator_id": "71985416",
"group": {
"id": "topics"
},
"id": "9323816986",
"name": "Item 1",
"parent_item": null,
"state": "archived",
"subitems": [],
"subscribers": [
{
"id": "71985416"
}
],
"updated_at": "redacted",
"updates": []
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of fields that are no longer returned for items. What's the reason for that?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ops. That was because I simplified the query for final testing to get data quicker and forgot to add it back. Note, that this was only for a final test. I tested with full queries too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the captured board/items is different since the prior snapshot was for a deleted board and deleted boards should not be captured since their items/subitems cannot be queried.

…erformance and memory improvements

• Restructure monolithic graphql.py (1109 lines) into specialized modules:
  - graphql/activity_logs.py for activity log operations
  - graphql/boards.py for board management
  - graphql/items/ package with items.py and item_cache.py
  - graphql/query_executor.py for centralized query execution with IncrementalJsonProcessor
  - graphql/constants.py for shared constants
• Implement IncrementalJsonProcessor for memory-efficient streaming of large GraphQL responses
• Add item cache system for efficient items backfill operations and reduced API calls
• Enhance API client with improved error handling, complexity management, and debugging capabilities
• Improve code quality, API compliance, and maintainability through modular design
• Update models and resources to support modular architecture
• Add utility functions for common operations
• Update test snapshots to reflect architectural changes

This refactoring addresses data discrepancy (missing data) issues and maintainability concerns
while significantly improving performance and memory efficiency for data synchronization operations.
@JustinASmith JustinASmith force-pushed the js/monday-simplification branch from 61dd163 to e85e629 Compare July 10, 2025 16:26
@JustinASmith JustinASmith merged commit f9bc82b into main Jul 10, 2025
93 of 103 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants