-
Notifications
You must be signed in to change notification settings - Fork 20
feat(cdk): add CombinedExtractor #551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
/autofix
|
📝 WalkthroughWalkthroughA new Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant DeclarativeSource
participant ModelToComponentFactory
participant CombinedExtractor
participant SubExtractor1
participant SubExtractor2
User->>DeclarativeSource: Define stream with CombinedExtractor
DeclarativeSource->>ModelToComponentFactory: Build extractor from model
ModelToComponentFactory->>CombinedExtractor: Instantiate with sub-extractors
CombinedExtractor->>SubExtractor1: extract_records(response)
CombinedExtractor->>SubExtractor2: extract_records(response)
CombinedExtractor->>CombinedExtractor: Zip and merge records
CombinedExtractor-->>DeclarativeSource: Yield merged records
Would you like me to provide a more detailed sequence diagram focusing on the internal merging logic of 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
⏰ Context from checks skipped due to timeout of 90000ms (6)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
🧹 Nitpick comments (17)
airbyte_cdk/sources/declarative/extractors/combined_extractor.py (2)
1-3
: Update copyright year.The copyright notice shows 2025, but the current year is 2024.
-# Copyright (c) 2025 Airbyte, Inc., all rights reserved. +# Copyright (c) 2024 Airbyte, Inc., all rights reserved.
13-35
: Well-documented extractor implementation with a clear docstring.The class documentation clearly explains the purpose and behavior of the extractor. Could you consider adding information about edge cases to the docstring? Such as:
- What happens when extractors produce different numbers of records?
- How are key conflicts handled when merging records from different extractors?
airbyte_cdk/sources/declarative/extractors/key_value_extractor.py (1)
38-42
: Edge-case: infinite loop whenkeys
is empty.If
keys_extractor
yields an empty list,len(keys)
is0
;islice(values, 0)
returns an empty iterator, leavingchunk
empty and the loop spinning forever because thebreak
is never reached.Maybe bail out early?
keys = list(self.keys_extractor.extract_records(response)) - values = self.values_extractor.extract_records(response) + if not keys: + return # nothing to map, silently exit + values = self.values_extractor.extract_records(response)Would that make sense?
airbyte_cdk/sources/declarative/resolvers/config_components_resolver.py (2)
124-125
: Guard againstproduct()
on empty input?If
self.stream_configs
is an empty list,all_indexed_streams
becomes empty andproduct(*all_indexed_streams)
raisesTypeError
.
Perhaps return an empty list (or[{}]
) in that scenario?all_indexed_streams = list(prepare_streams()) if not all_indexed_streams: return [{}] return [merge_combination(c) for c in product(*all_indexed_streams)]Would that make the resolver more robust?
🧰 Tools
🪛 GitHub Actions: Linters
[error] 124-124: Call to untyped function "prepare_streams" in typed context [no-untyped-call]
[error] 125-125: Call to untyped function "merge_combination" in typed context [no-untyped-call]
154-159
:create_or_update
path handling silently ignores falsy values – intentional?
if parsed_value and not updated and resolved_component.create_or_update:
• A legitimate falsy value (e.g.,
0
,False
,""
) will block path creation.
• Should the check beupdated is False
instead of the truthiness ofparsed_value
?Just checking intent, wdyt?
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (5)
1744-1768
: PolishKeyValueExtractor
schema definition
The block currently uses placeholder text fordescription
and an enum value with extra whitespace (enum: [ KeyValueExtractor ]
). Could we clarify thedescription
, remove the leading space in the enum, and include anexamples
section underproperties
to illustrate typical usage? wdyt?
1769-1791
: EnhanceCombinedExtractor
schema clarity
Similar toKeyValueExtractor
, theCombinedExtractor
definition has a placeholderdescription
and an enum with surrounding spaces (enum: [ CombinedExtractor ]
). Should we refine the description to explain its purpose, normalize the enum formatting, and add example usages for theextractors
array? wdyt?
2359-2364
: Defineschema_filter
description and examples
The newschema_filter
property uses a placeholderdescription
. Could we provide a meaningful description (e.g. "Filter to apply to schema records before merging") and add examples showing aRecordFilter
orCustomRecordFilter
in action? wdyt?
4053-4055
: Improvecreate_or_update
property documentation
Thecreate_or_update
flag lacks a descriptive comment. Could we add adescription
to explain its behavior (e.g. "If true, existing fields will be updated; otherwise only new fields are created")? wdyt?
4122-4126
: Add examples for multi-configstream_config
Thestream_config
property now accepts either a singleStreamConfig
or an array. Could we include examples in the schema comments to illustrate both cases and guide users? wdyt?airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
2247-2259
: Same default-argument pattern & nil-safety increate_combined_extractor
For the same reasons as above, could we:
- Change the
decoder
default toNone
and lazy-instanciate it?- (Optional) Validate
model.extractors
is not empty to fail fast.- decoder: Optional[Decoder] = JsonDecoder(parameters={}), + decoder: Optional[Decoder] = None, ... - extractors = [ + if decoder is None: + decoder = JsonDecoder(parameters={}) + + extractors = [ self._create_component_from_model(model=extractor, decoder=decoder, config=config) for extractor in model.extractors ]airbyte_cdk/sources/declarative/models/declarative_component_schema.py (6)
1481-1482
: Add validation metadata forcreate_or_update
?The new flag looks great. Would you consider wrapping it in
Field(..., description="...", title="Create Or Update")
to keep the rich-schema experience consistent with the other attributes, wdyt?
1499-1500
: Field name vs. type plurality
stream_config
now accepts a list, yet the attribute remains singular.
To avoid confusion for users authoring manifests, would renaming it tostream_configs
(and adding an alias for backwards compatibility) make sense, wdyt?
1847-1856
: KeyValueExtractor – consider min-items & richer docs
- Should we enforce at least one key-value pair by adding
min_items=1
onkeys_extractor
/values_extractor
?- The
description="placeholder"
texts will surface in the JSON-schema. Maybe flesh them out to guide connector authors, wdyt?Both are tiny polish items but improve UX.
1858-1864
: CombinedExtractor – guarantee non-emptyextractors
list
extractors
is required but could still be empty. Addingmin_items=1
inField()
would prevent a silent mis-configuration. Worth tightening?
Also, expanding the placeholder description will help future users.
1903-1903
: Forward-ref update for dependant models
RecordSelector
now referencesCombinedExtractor
; you addedCombinedExtractor.update_forward_refs()
(👍).
For full safety, should we also callRecordSelector.update_forward_refs()
so that external callers importing only that class don’t have to?
2436-2438
:schema_filter
placeholder descriptionNice addition! Would you like to replace
"placeholder"
with something actionable (e.g. “Filter to drop/keep schema properties based on a record-level predicate”)?
This will read better in generated docs, wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting
📒 Files selected for processing (10)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml
(6 hunks)airbyte_cdk/sources/declarative/extractors/__init__.py
(2 hunks)airbyte_cdk/sources/declarative/extractors/combined_extractor.py
(1 hunks)airbyte_cdk/sources/declarative/extractors/key_value_extractor.py
(1 hunks)airbyte_cdk/sources/declarative/manifest_declarative_source.py
(1 hunks)airbyte_cdk/sources/declarative/models/declarative_component_schema.py
(7 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(9 hunks)airbyte_cdk/sources/declarative/resolvers/components_resolver.py
(2 hunks)airbyte_cdk/sources/declarative/resolvers/config_components_resolver.py
(6 hunks)airbyte_cdk/sources/declarative/schema/dynamic_schema_loader.py
(4 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (5)
airbyte_cdk/sources/declarative/extractors/__init__.py (4)
airbyte_cdk/sources/declarative/extractors/combined_extractor.py (1)
CombinedExtractor
(14-45)airbyte_cdk/sources/declarative/models/declarative_component_schema.py (3)
CombinedExtractor
(1858-1863)DpathExtractor
(696-709)KeyValueExtractor
(1847-1855)airbyte_cdk/sources/declarative/extractors/http_selector.py (1)
HttpSelector
(13-37)airbyte_cdk/sources/declarative/extractors/key_value_extractor.py (1)
KeyValueExtractor
(15-42)
airbyte_cdk/sources/declarative/extractors/key_value_extractor.py (3)
airbyte_cdk/sources/declarative/extractors/record_extractor.py (1)
RecordExtractor
(12-27)airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
KeyValueExtractor
(1847-1855)airbyte_cdk/sources/declarative/extractors/combined_extractor.py (1)
extract_records
(38-45)
airbyte_cdk/sources/declarative/extractors/combined_extractor.py (3)
airbyte_cdk/sources/declarative/extractors/record_extractor.py (1)
RecordExtractor
(12-27)airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
CombinedExtractor
(1858-1863)airbyte_cdk/sources/declarative/extractors/key_value_extractor.py (1)
extract_records
(34-42)
airbyte_cdk/sources/declarative/resolvers/config_components_resolver.py (1)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
StreamConfig
(1485-1494)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4)
airbyte_cdk/sources/declarative/extractors/combined_extractor.py (1)
CombinedExtractor
(14-45)airbyte_cdk/sources/declarative/models/declarative_component_schema.py (20)
CombinedExtractor
(1858-1863)DpathExtractor
(696-709)KeyValueExtractor
(1847-1855)Config
(134-135)Config
(148-149)Config
(162-163)Config
(176-177)Config
(194-195)Config
(208-209)Config
(222-223)Config
(236-237)Config
(250-251)Config
(264-265)Config
(278-279)Config
(292-293)Config
(308-309)Config
(322-323)Config
(336-337)Config
(370-371)DynamicSchemaLoader
(2429-2458)airbyte_cdk/sources/declarative/extractors/key_value_extractor.py (1)
KeyValueExtractor
(15-42)airbyte_cdk/sources/declarative/schema/dynamic_schema_loader.py (1)
DynamicSchemaLoader
(120-297)
🪛 GitHub Actions: Linters
airbyte_cdk/sources/declarative/schema/dynamic_schema_loader.py
[error] 182-182: Argument 1 to "filter_records" of "RecordFilter" has incompatible type "ItemsView[str, Any]"; expected "Iterable[Mapping[str, Any]]" [arg-type]
[error] 183-183: Invalid index type "int" for "Mapping[str, Any]"; expected type "str" [index]
airbyte_cdk/sources/declarative/resolvers/config_components_resolver.py
[error] 97-97: Function is missing a return type annotation [no-untyped-def]
[error] 98-98: Function is missing a type annotation [no-untyped-def]
[error] 103-103: Function is missing a type annotation [no-untyped-def]
[error] 106-106: Function is missing a return type annotation [no-untyped-def]
[error] 108-108: Call to untyped function "resolve_path" in typed context [no-untyped-call]
[error] 110-110: Call to untyped function "normalize_configs" in typed context [no-untyped-call]
[error] 112-112: Unsupported left operand type for + ("object") [operator]
[error] 113-113: Need type annotation for "item" [var-annotated]
[error] 113-113: Argument 1 to "enumerate" has incompatible type "object"; expected "Iterable[Never]" [arg-type]
[error] 115-115: Function is missing a type annotation [no-untyped-def]
[error] 124-124: Call to untyped function "prepare_streams" in typed context [no-untyped-call]
[error] 125-125: Call to untyped function "merge_combination" in typed context [no-untyped-call]
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
[error] 2476-2476: Argument 1 to "_create_component_from_model" of "ModelToComponentFactory" has incompatible type "RecordFilter | CustomRecordFilter | None"; expected "BaseModel" [arg-type]
airbyte_cdk/sources/declarative/models/declarative_component_schema.py
[error] 1493-1493: Missing type parameters for generic type "List" [type-arg]
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (7)
airbyte_cdk/sources/declarative/extractors/__init__.py (2)
5-8
: Imports for new extractor classes are added correctly.The imports for the new
CombinedExtractor
andKeyValueExtractor
classes follow the existing import pattern and alphabetical ordering, which is good for code organization.
16-25
: Updated__all__
list to include new extractor classes.The new extractor classes are correctly added to the
__all__
list, maintaining alphabetical ordering which helps with maintainability.airbyte_cdk/sources/declarative/resolvers/components_resolver.py (2)
25-25
: New parametercreate_or_update
added with sensible default.The boolean parameter is added with a default of
False
, which maintains backward compatibility. This parameter will control whether new config paths should be created when setting values during component resolution.
38-38
: Matchingcreate_or_update
parameter added toResolvedComponentMappingDefinition
.Good consistency in adding the same parameter with the same default value to both dataclasses.
airbyte_cdk/sources/declarative/manifest_declarative_source.py (1)
265-265
: Good refactoring for cleaner code.Using the
dynamic_streams
property instead of directly calling_dynamic_stream_configs
improves code readability and follows good OOP principles by encapsulating the implementation details.airbyte_cdk/sources/declarative/extractors/combined_extractor.py (1)
38-45
: Clean implementation for combining multiple extractors.The implementation is concise and effective. It creates iterables from each extractor, zips them together (handling the case where they have different lengths by stopping at the shortest), and merges the records using
dict.update()
.Note that when merging dictionaries, if the same key exists in multiple records, later records will overwrite values from earlier records. This is the expected behavior of
dict.update()
, but worth being aware of.airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)
3369-3374
: ConfirmRecordSelector
supports new extractors
Great to seeCombinedExtractor
andKeyValueExtractor
added to theextractor
union. This aligns with the new definitions and will enable composite extraction patterns.
airbyte_cdk/sources/declarative/schema/dynamic_schema_loader.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/extractors/key_value_extractor.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/resolvers/config_components_resolver.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/declarative_component_schema.yaml
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/models/declarative_component_schema.py
Outdated
Show resolved
Hide resolved
/autofix
|
/autofix
|
What
Resolved: https://github.com/airbytehq/airbyte-internal-issues/issues/13046
Introduce a
CombinedExtractor
that merges the output of multiple sub-extractors into a single record.How
Accepts a list of
RecordExtractor
instances.For each response:
zip()
to pair records at the same index across all extractors.dict.update()
.Example
If one extractor yields
{"name": "Alice", "age": 30}
and another yields{"country": "US"}
, the combined result will be:Summary by CodeRabbit
Summary by CodeRabbit