Conversation
56434cb to
8d75ad1
Compare
…ined at module level
Add @overload signatures to IncrementalCSVProcessor allowing it to be used without a Pydantic model. When a Pydanitc model is not used, raw dicts are yielded instead with no validation performed. This enables connectors to bypass Pydantic validation overhead for high-throughput CSV processing where the performance cost of validation outweighs its benefits.
…on for documents Pydantic valiation is arguably not necessary for this connector. It primarily processes CSVs and performs very light transformation (convert empty strings to `None`, convert boolean fields to actual boolean values, add in the `_meta.op` field). Yielding validated Pydantic model instances also causes the CDK to serialize documents with model_dump_json(), performing some additional validation. This connector has high CPU usage, and removing some Pydantic validation should help speed up the connector. By performing the light transformation itself and yielding raw dicts, the connector uses the CDK's faster `orjson` serialization pathway when capturing documents, avoiding overhead from Pydantic validation and serialization. This ends up reducing steady state CPU usage from ~80% to ~35% when capturing a single data-heavy binding.
8d75ad1 to
94571b5
Compare
nicolaslazo
approved these changes
Jan 11, 2026
Contributor
nicolaslazo
left a comment
There was a problem hiding this comment.
LGTM. I really like that overload trick to narrow down those __anext__ signatures
…n incremental_csv_processor.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description:
When capturing documents,
source-dynamics-365-finance-and-operationsis CPU bound - the connector can't capture documents faster than it's reading rows from CSVs. Yielding Pydantic model instances as documents adds some validation and serialization overhead that's arguably not necessary forsource-dynamics-365-finance-and-operations; we're doing very light transformations on a small handful of fields, and all others are left unchanged. Yieldingdicts instead avoids some of this overhead and goes through the fasterorjsonserialization path in the CDK.Workflow steps:
(How does one use this feature, and how has it changed)
Documentation links affected:
(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)
Notes for reviewers:
Tested on a local stack and with benchmark tests. Confirmed:
py-spyflame graphs show Pydantic validation and serialization no longer takes up a signficant portion of CPU time when capturing documents.docker statsshows steady state CPU usage dropped from ~80% to ~35% after this change when capturing from a single binding.