-
Notifications
You must be signed in to change notification settings - Fork 24
feat(cdk): add CombinedExtractor #551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/autofix
|
Warning Rate limit exceeded@lazebnyi has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 3 minutes and 24 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
📝 Walkthrough""" WalkthroughA new Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant DeclarativeSource
participant ModelToComponentFactory
participant CombinedExtractor
participant SubExtractor1
participant SubExtractor2
User->>DeclarativeSource: Define stream with CombinedExtractor
DeclarativeSource->>ModelToComponentFactory: Build extractor from model
ModelToComponentFactory->>CombinedExtractor: Instantiate with sub-extractors
CombinedExtractor->>SubExtractor1: extract_records(response)
CombinedExtractor->>SubExtractor2: extract_records(response)
CombinedExtractor->>CombinedExtractor: Zip and merge records
CombinedExtractor-->>DeclarativeSource: Yield merged records
Would you like me to provide a more detailed sequence diagram focusing on the internal merging logic of ✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
🧹 Nitpick comments (17)
airbyte_cdk/sources/declarative/extractors/combined_extractor.py (2)
1-3
: Update copyright year.The copyright notice shows 2025, but the current year is 2024.
-# Copyright (c) 2025 Airbyte, Inc., all rights reserved. +# Copyright (c) 2024 Airbyte, Inc., all rights reserved.
13-35
: Well-documented extractor implementation with a clear docstring.The class documentation clearly explains the purpose and behavior of the extractor. Could you consider adding information about edge cases to the docstring? Such as:
- What happens when extractors produce different numbers of records?
- How are key conflicts handled when merging records from different extractors?
airbyte_cdk/sources/declarative/extractors/key_value_extractor.py (1)
38-42
: Edge-case: infinite loop whenkeys
is empty.If
keys_extractor
yields an empty list,len(keys)
is0
;islice(values, 0)
returns an empty iterator, leavingchunk
empty and the loop spinning forever because thebreak
is never reached.Maybe bail out early?
keys = list(self.keys_extractor.extract_records(response)) - values = self.values_extractor.extract_records(response) + if not keys: + return # nothing to map, silently exit + values = self.values_extractor.extract_records(response)Would that make sense?
airbyte_cdk/sources/declarative/resolvers/config_components_resolver.py (2)
124-125
: Guard againstproduct()
on empty input?If
self.stream_configs
is an empty list,all_indexed_streams
becomes empty andproduct(*all_indexed_streams)
raisesTypeError
.
Perhaps return an empty list (or[{}]
) in that scenario?all_indexed_streams = list(prepare_streams()) if not all_indexed_streams: return [{}] return [merge_combination(c) for c in product(*all_indexed_streams)]Would that make the resolver more robust?
🧰 Tools
🪛 GitHub Actions: Linters
[error] 124-124: Call to untyped function "prepare_streams" in typed context [no-untyped-call]
[error] 125-125: Call to untyped function "merge_combination" in typed context [no-untyped-call]
154-159
:create_or_update
path handling silently ignores falsy values – intentional?
if parsed_value and not updated and resolved_component.create_or_update:
• A legitimate falsy value (e.g.,
0
,False
,""
) will block path creation.
• Should the check beupdated is False
instead of the truthiness ofparsed_value
?Just checking intent, wdyt?
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (5)
1744-1768
: PolishKeyValueExtractor
schema definition
The block currently uses placeholder text fordescription
and an enum value with extra whitespace (enum: [ KeyValueExtractor ]
). Could we clarify thedescription
, remove the leading space in the enum, and include anexamples
section underproperties
to illustrate typical usage? wdyt?
1769-1791
: EnhanceCombinedExtractor
schema clarity
Similar toKeyValueExtractor
, theCombinedExtractor
definition has a placeholderdescription
and an enum with surrounding spaces (enum: [ CombinedExtractor ]
). Should we refine the description to explain its purpose, normalize the enum formatting, and add example usages for theextractors
array? wdyt?
2359-2364
: Defineschema_filter
description and examples
The newschema_filter
property uses a placeholderdescription
. Could we provide a meaningful description (e.g. "Filter to apply to schema records before merging") and add examples showing aRecordFilter
orCustomRecordFilter
in action? wdyt?
4053-4055
: Improvecreate_or_update
property documentation
Thecreate_or_update
flag lacks a descriptive comment. Could we add adescription
to explain its behavior (e.g. "If true, existing fields will be updated; otherwise only new fields are created")? wdyt?
4122-4126
: Add examples for multi-configstream_config
Thestream_config
property now accepts either a singleStreamConfig
or an array. Could we include examples in the schema comments to illustrate both cases and guide users? wdyt?airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
2247-2259
: Same default-argument pattern & nil-safety increate_combined_extractor
For the same reasons as above, could we:
- Change the
decoder
default toNone
and lazy-instanciate it?- (Optional) Validate
model.extractors
is not empty to fail fast.- decoder: Optional[Decoder] = JsonDecoder(parameters={}), + decoder: Optional[Decoder] = None, ... - extractors = [ + if decoder is None: + decoder = JsonDecoder(parameters={}) + + extractors = [ self._create_component_from_model(model=extractor, decoder=decoder, config=config) for extractor in model.extractors ]airbyte_cdk/sources/declarative/models/declarative_component_schema.py (6)
1481-1482
: Add validation metadata forcreate_or_update
?The new flag looks great. Would you consider wrapping it in
Field(..., description="...", title="Create Or Update")
to keep the rich-schema experience consistent with the other attributes, wdyt?
1499-1500
: Field name vs. type plurality
stream_config
now accepts a list, yet the attribute remains singular.
To avoid confusion for users authoring manifests, would renaming it tostream_configs
(and adding an alias for backwards compatibility) make sense, wdyt?
1847-1856
: KeyValueExtractor – consider min-items & richer docs
- Should we enforce at least one key-value pair by adding
min_items=1
onkeys_extractor
/values_extractor
?- The
description="placeholder"
texts will surface in the JSON-schema. Maybe flesh them out to guide connector authors, wdyt?Both are tiny polish items but improve UX.
1858-1864
: CombinedExtractor – guarantee non-emptyextractors
list
extractors
is required but could still be empty. Addingmin_items=1
inField()
would prevent a silent mis-configuration. Worth tightening?
Also, expanding the placeholder description will help future users.
1903-1903
: Forward-ref update for dependant models
RecordSelector
now referencesCombinedExtractor
; you addedCombinedExtractor.update_forward_refs()
(👍).
For full safety, should we also callRecordSelector.update_forward_refs()
so that external callers importing only that class don’t have to?
2436-2438
:schema_filter
placeholder descriptionNice addition! Would you like to replace
"placeholder"
with something actionable (e.g. “Filter to drop/keep schema properties based on a record-level predicate”)?
This will read better in generated docs, wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting
📒 Files selected for processing (10)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml
(6 hunks)airbyte_cdk/sources/declarative/extractors/__init__.py
(2 hunks)airbyte_cdk/sources/declarative/extractors/combined_extractor.py
(1 hunks)airbyte_cdk/sources/declarative/extractors/key_value_extractor.py
(1 hunks)airbyte_cdk/sources/declarative/manifest_declarative_source.py
(1 hunks)airbyte_cdk/sources/declarative/models/declarative_component_schema.py
(7 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(9 hunks)airbyte_cdk/sources/declarative/resolvers/components_resolver.py
(2 hunks)airbyte_cdk/sources/declarative/resolvers/config_components_resolver.py
(6 hunks)airbyte_cdk/sources/declarative/schema/dynamic_schema_loader.py
(4 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (5)
airbyte_cdk/sources/declarative/extractors/__init__.py (4)
airbyte_cdk/sources/declarative/extractors/combined_extractor.py (1)
CombinedExtractor
(14-45)airbyte_cdk/sources/declarative/models/declarative_component_schema.py (3)
CombinedExtractor
(1858-1863)DpathExtractor
(696-709)KeyValueExtractor
(1847-1855)airbyte_cdk/sources/declarative/extractors/http_selector.py (1)
HttpSelector
(13-37)airbyte_cdk/sources/declarative/extractors/key_value_extractor.py (1)
KeyValueExtractor
(15-42)
airbyte_cdk/sources/declarative/extractors/key_value_extractor.py (3)
airbyte_cdk/sources/declarative/extractors/record_extractor.py (1)
RecordExtractor
(12-27)airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
KeyValueExtractor
(1847-1855)airbyte_cdk/sources/declarative/extractors/combined_extractor.py (1)
extract_records
(38-45)
airbyte_cdk/sources/declarative/extractors/combined_extractor.py (3)
airbyte_cdk/sources/declarative/extractors/record_extractor.py (1)
RecordExtractor
(12-27)airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
CombinedExtractor
(1858-1863)airbyte_cdk/sources/declarative/extractors/key_value_extractor.py (1)
extract_records
(34-42)
airbyte_cdk/sources/declarative/resolvers/config_components_resolver.py (1)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
StreamConfig
(1485-1494)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4)
airbyte_cdk/sources/declarative/extractors/combined_extractor.py (1)
CombinedExtractor
(14-45)airbyte_cdk/sources/declarative/models/declarative_component_schema.py (20)
CombinedExtractor
(1858-1863)DpathExtractor
(696-709)KeyValueExtractor
(1847-1855)Config
(134-135)Config
(148-149)Config
(162-163)Config
(176-177)Config
(194-195)Config
(208-209)Config
(222-223)Config
(236-237)Config
(250-251)Config
(264-265)Config
(278-279)Config
(292-293)Config
(308-309)Config
(322-323)Config
(336-337)Config
(370-371)DynamicSchemaLoader
(2429-2458)airbyte_cdk/sources/declarative/extractors/key_value_extractor.py (1)
KeyValueExtractor
(15-42)airbyte_cdk/sources/declarative/schema/dynamic_schema_loader.py (1)
DynamicSchemaLoader
(120-297)
🪛 GitHub Actions: Linters
airbyte_cdk/sources/declarative/schema/dynamic_schema_loader.py
[error] 182-182: Argument 1 to "filter_records" of "RecordFilter" has incompatible type "ItemsView[str, Any]"; expected "Iterable[Mapping[str, Any]]" [arg-type]
[error] 183-183: Invalid index type "int" for "Mapping[str, Any]"; expected type "str" [index]
airbyte_cdk/sources/declarative/resolvers/config_components_resolver.py
[error] 97-97: Function is missing a return type annotation [no-untyped-def]
[error] 98-98: Function is missing a type annotation [no-untyped-def]
[error] 103-103: Function is missing a type annotation [no-untyped-def]
[error] 106-106: Function is missing a return type annotation [no-untyped-def]
[error] 108-108: Call to untyped function "resolve_path" in typed context [no-untyped-call]
[error] 110-110: Call to untyped function "normalize_configs" in typed context [no-untyped-call]
[error] 112-112: Unsupported left operand type for + ("object") [operator]
[error] 113-113: Need type annotation for "item" [var-annotated]
[error] 113-113: Argument 1 to "enumerate" has incompatible type "object"; expected "Iterable[Never]" [arg-type]
[error] 115-115: Function is missing a type annotation [no-untyped-def]
[error] 124-124: Call to untyped function "prepare_streams" in typed context [no-untyped-call]
[error] 125-125: Call to untyped function "merge_combination" in typed context [no-untyped-call]
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
[error] 2476-2476: Argument 1 to "_create_component_from_model" of "ModelToComponentFactory" has incompatible type "RecordFilter | CustomRecordFilter | None"; expected "BaseModel" [arg-type]
airbyte_cdk/sources/declarative/models/declarative_component_schema.py
[error] 1493-1493: Missing type parameters for generic type "List" [type-arg]
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (7)
airbyte_cdk/sources/declarative/extractors/__init__.py (2)
5-8
: Imports for new extractor classes are added correctly.The imports for the new
CombinedExtractor
andKeyValueExtractor
classes follow the existing import pattern and alphabetical ordering, which is good for code organization.
16-25
: Updated__all__
list to include new extractor classes.The new extractor classes are correctly added to the
__all__
list, maintaining alphabetical ordering which helps with maintainability.airbyte_cdk/sources/declarative/resolvers/components_resolver.py (2)
25-25
: New parametercreate_or_update
added with sensible default.The boolean parameter is added with a default of
False
, which maintains backward compatibility. This parameter will control whether new config paths should be created when setting values during component resolution.
38-38
: Matchingcreate_or_update
parameter added toResolvedComponentMappingDefinition
.Good consistency in adding the same parameter with the same default value to both dataclasses.
airbyte_cdk/sources/declarative/manifest_declarative_source.py (1)
265-265
: Good refactoring for cleaner code.Using the
dynamic_streams
property instead of directly calling_dynamic_stream_configs
improves code readability and follows good OOP principles by encapsulating the implementation details.airbyte_cdk/sources/declarative/extractors/combined_extractor.py (1)
38-45
: Clean implementation for combining multiple extractors.The implementation is concise and effective. It creates iterables from each extractor, zips them together (handling the case where they have different lengths by stopping at the shortest), and merges the records using
dict.update()
.Note that when merging dictionaries, if the same key exists in multiple records, later records will overwrite values from earlier records. This is the expected behavior of
dict.update()
, but worth being aware of.airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)
3369-3374
: ConfirmRecordSelector
supports new extractors
Great to seeCombinedExtractor
andKeyValueExtractor
added to theextractor
union. This aligns with the new definitions and will enable composite extraction patterns.
airbyte_cdk/sources/declarative/schema/dynamic_schema_loader.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/extractors/key_value_extractor.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/resolvers/config_components_resolver.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/declarative_component_schema.yaml
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/models/declarative_component_schema.py
Outdated
Show resolved
Hide resolved
/autofix
|
/autofix
|
…q/airbyte-python-cdk into lazebnyi/add-combined-extractor
/autofix
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
unit_tests/sources/declarative/extractors/test_combined_extractor.py (1)
105-139
: 🛠️ Refactor suggestionConsider adding tests for edge cases
The test verifies basic functionality, but there are several edge cases that aren't covered:
- What happens if extractors return different numbers of records?
- How are key conflicts handled if two extractors extract records with the same keys?
- How does the extractor handle errors in one of the sub-extractors?
Would it be valuable to add tests for these scenarios?
🧹 Nitpick comments (4)
unit_tests/sources/declarative/extractors/test_combined_extractor.py (4)
1-3
: Is the copyright year intended to be 2025?I noticed the copyright year is set to 2025, which is in the future. Should this be updated to the current year?
-# Copyright (c) 2025 Airbyte, Inc., all rights reserved. +# Copyright (c) 2023 Airbyte, Inc., all rights reserved.wdyt?
46-53
: Unused test configurationThis configuration constant
_CONFIG_WITH_STREAM_DUPLICATION
is defined but not used in any tests. Are you planning to add more tests that will use this, or can it be removed?
105-108
: Consider more descriptive test nameThe test function name
test_combined_extractor
could be more specific about what aspect of the extractor it's testing (merging records from multiple extractors). Maybe something liketest_combined_extractor_merges_multiple_extractions
would be more descriptive?
137-138
: Add specific failure messages to assertionsAdding custom failure messages to assertions helps with diagnosing test failures:
-assert len(records) == 1 -assert records[0].data == {"customer_segment": "enterprise", "metrics": {"revenue": 295000}} +assert len(records) == 1, "CombinedExtractor should produce exactly one record" +assert records[0].data == {"customer_segment": "enterprise", "metrics": {"revenue": 295000}}, "Combined record data doesn't match expected structure"What do you think?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting
📒 Files selected for processing (1)
unit_tests/sources/declarative/extractors/test_combined_extractor.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: SDM Docker Image Build
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Analyze (python)
🔇 Additional comments (3)
unit_tests/sources/declarative/extractors/test_combined_extractor.py (3)
92-95
: The test implementation matches the declared featureThe combined extractor configuration looks good - it creates a combination of two DpathExtractor instances that will extract from different paths in the response.
122-128
: Test data setup looks goodThe mocked HTTP response contains appropriate test data with both dimensions and metrics fields that will be extracted by the two different extractors.
137-138
: The combined extractor preserves the nested structureI noticed that in the assertion, the combined record has
"metrics": {"revenue": 295000}
rather than flattening the metrics data. This seems to be working as designed, but it's different from the example in the PR description where both extractors' outputs are flattened into a single level.For clarity, should we add a comment about how the CombinedExtractor preserves nested structures from individual extractors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking over how this extractor is used, I think this feels better suited as a custom component used by source-google-analytics
. The main part to me that feels particularly specific to one connector is the merging of all the combined extractor records into a single record instance and emitting just one record.
Unless we have other examples where this extractor would be used by other connectors, I propose that we simplify things by defining this implementation in within source-google-analytics-data-api
. And if we see a need to promote this back to the CDK at a later time we can.
Since the language supports CustomRecordExtractor
it seems like we don't have any blocker on just defining this in components.py
A custom component will be used for the migration instead of this implementation. |
What
Resolved: https://github.com/airbytehq/airbyte-internal-issues/issues/13046
Introduce a
CombinedExtractor
that merges the output of multiple sub-extractors into a single record.How
Accepts a list of
RecordExtractor
instances.For each response:
zip()
to pair records at the same index across all extractors.dict.update()
.Example
If one extractor yields
{"name": "Alice", "age": 30}
and another yields{"country": "US"}
, the combined result will be:Summary by CodeRabbit