-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Branch for Files + Metadata spike protocol changes #111
base: main
Are you sure you want to change the base?
Changes from 5 commits
dfeb195
edc46db
bc9eb42
e0eed9f
35c28ad
b307145
3e4dbc3
aeb37dc
cd0a191
b69e4ce
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,3 @@ | ||
# x-release-please-start-version | ||
VERSION=0.14.2 | ||
VERSION=0.14.3-SNAPSHOT | ||
# x-release-please-end |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -76,6 +76,9 @@ definitions: | |
meta: | ||
description: Information about this record added mid-sync | ||
"$ref": "#/definitions/AirbyteRecordMessageMeta" | ||
file_reference: | ||
description: A reference to assoicated file to be moved with along with the record data. | ||
"$ref": "#/definitions/AirbyteRecordMessageFileReference" | ||
AirbyteRecordMessageMeta: | ||
type: object | ||
additionalProperties: true | ||
|
@@ -122,6 +125,22 @@ definitions: | |
- SOURCE_RETRIEVAL_ERROR | ||
# Errors casting to appropriate type | ||
- DESTINATION_TYPECAST_ERROR | ||
AirbyteRecordMessageFileReference: | ||
type: object | ||
additionalProperties: true | ||
properties: | ||
file_url: | ||
type: string | ||
description: |- | ||
The absolute path to the referenced file. | ||
file_relative_path: | ||
type: string | ||
description: |- | ||
The relative path to the referenced file in source. | ||
file_size_bytes: | ||
type: integer | ||
description: |- | ||
The size of the referenced file in bytes. | ||
AirbyteStateMessage: | ||
type: object | ||
additionalProperties: true | ||
|
@@ -484,6 +503,10 @@ definitions: | |
If the stream is resumable or not. Should be set to true if the stream supports incremental. Defaults to false. | ||
Primarily used by the Platform in Full Refresh to determine if a Full Refresh stream should actually be treated as incremental within a job. | ||
type: boolean | ||
is_file_based: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: if we're representing files as something that can be included as part of standard records, maybe isntead of a whole stream being "file based" this is better expressed as "has files" or "has attachments" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good call out. We can include that in the next iteration. Tentatively thinking |
||
type: boolean | ||
description: |- | ||
This stream represents a series of discrete files and their metadata. | ||
ConfiguredAirbyteCatalog: | ||
type: object | ||
additionalProperties: true | ||
|
@@ -545,6 +568,10 @@ definitions: | |
If this is null, it means that the platform is not supporting the refresh and it is expected that no extra id will be added to the records and no data from previous generation will be cleanup. | ||
" | ||
type: integer | ||
copy_associated_file: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is lacking if we want to keep the current behavior as this implies that we will always sync the metadata. Today the behavior is "copy file only". It seems like we want to add two behaviors: "sync metadata as record" and "sync metadata as record and copy file". It feels like a simple boolean is not able to express those three options. Therefore, I have a couple of questions:
I think the source can always send the metadata as it can be filtered out by the platform like we does when a returned field is not part of the configured catalog. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe something like
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this use-case equivalent to de-selecting the metadata fields you dont want, or is this a materially different way to sync? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It sounds like it would be the equivalent yes There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yea, I was thinking we may be able to just get by using the existing "column selection" feature to effectively provide this for us. cc @davinchia @matteogp to assist with product questions There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
type: boolean | ||
description: |- | ||
If the stream is_file_based, the associated file will be copied to local disk and passed via reference along with its metadata. | ||
SyncMode: | ||
type: string | ||
enum: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit/Question: If these file references will always be part of a stream that describes the file in more detail (original URI, mime-type, etc) wouldn't
file_size_bytes
be in therecord.data
json? Is that information duplicated here so that the destination and/or platform can decide if the file is too big (or small?) to deal with without deserializingrecord.data
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question: these top level fields are for internal airbyte consumption. This one specific would be consumed by the orchestrator to be tallied for billing purposes. The
record.data
may also contain this information, but I do not believe we want to be introspecting data fields (which could theoretically be de-selected / filtered) for internal book-keeping.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the current source-s3 implementation return
{"file_url": absolute_file_path, "bytes": file_size, "file_relative_path": file_relative_path}
. Do we need the relative path?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question, I can add both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maxi297 do you have a link to where we define the currently used properties today?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit spread as it is defined in each source but here is the example for source-s3: https://github.com/airbytehq/airbyte/blob/ab088457429302b11d8909cad90631ad0f2a9f64/airbyte-integrations/connectors/source-s3/source_s3/v4/stream_reader.py#L261.
We also add some information as part of the CDK: https://github.com/airbytehq/airbyte-python-cdk/blob/837913f6372a6465472fae269dac7042f336945f/airbyte_cdk/sources/file_based/stream/default_file_based_stream.py#L148-L154
This is being assembled as a FileTransferMessage here: https://github.com/airbytehq/airbyte-python-cdk/blob/837913f6372a6465472fae269dac7042f336945f/airbyte_cdk/sources/utils/record_helper.py#L39-L42
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see
in the dest codebase, of which only
file_relative_path
andfile_relative_path
. So I'm planning on only adding those for now.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the PR to include
file_url
andfile_relative_path