-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Branch for Files + Metadata spike protocol changes #111
base: main
Are you sure you want to change the base?
Conversation
Hey there and thank you for opening this pull request! 👋🏼 We require pull request titles to follow the Conventional Commits specification and it looks like your proposed title needs to be adjusted. Details:
|
@@ -545,6 +564,10 @@ definitions: | |||
If this is null, it means that the platform is not supporting the refresh and it is expected that no extra id will be added to the records and no data from previous generation will be cleanup. | |||
" | |||
type: integer | |||
copy_associated_file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is lacking if we want to keep the current behavior as this implies that we will always sync the metadata. Today the behavior is "copy file only". It seems like we want to add two behaviors: "sync metadata as record" and "sync metadata as record and copy file". It feels like a simple boolean is not able to express those three options. Therefore, I have a couple of questions:
- Is the hypothesis that we want to keep the current behavior true? It feels like if we want to give flexibility on reducing destination costs, it could be interesting for the user
- If so, what would be an optimal way to represent that?
I think the source can always send the metadata as it can be filtered out by the platform like we does when a returned field is not part of the configured catalog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like
Some ConfiguredAirbyteCatalog Stream:
{
stream {
...
"supported_sync_file_metadata_modes": [None, "only_metadata", "metadata_with_file_upload"]
...
}
....
sync_file_metadata_mode: "only-metadata"
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this use-case equivalent to de-selecting the metadata fields you dont want, or is this a materially different way to sync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like it would be the equivalent yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the source can always send the metadata as it can be filtered out by the platform like we does when a returned field is not part of the configured catalog.
Yea, I was thinking we may be able to just get by using the existing "column selection" feature to effectively provide this for us. cc @davinchia @matteogp to assist with product questions
file_size_bytes: | ||
type: integer | ||
description: |- | ||
The size of the referenced file in bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit/Question: If these file references will always be part of a stream that describes the file in more detail (original URI, mime-type, etc) wouldn't file_size_bytes
be in the record.data
json? Is that information duplicated here so that the destination and/or platform can decide if the file is too big (or small?) to deal with without deserializing record.data
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question: these top level fields are for internal airbyte consumption. This one specific would be consumed by the orchestrator to be tallied for billing purposes. The record.data
may also contain this information, but I do not believe we want to be introspecting data fields (which could theoretically be de-selected / filtered) for internal book-keeping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the current source-s3 implementation return {"file_url": absolute_file_path, "bytes": file_size, "file_relative_path": file_relative_path}
. Do we need the relative path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question, I can add both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maxi297 do you have a link to where we define the currently used properties today?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit spread as it is defined in each source but here is the example for source-s3: https://github.com/airbytehq/airbyte/blob/ab088457429302b11d8909cad90631ad0f2a9f64/airbyte-integrations/connectors/source-s3/source_s3/v4/stream_reader.py#L261.
We also add some information as part of the CDK: https://github.com/airbytehq/airbyte-python-cdk/blob/837913f6372a6465472fae269dac7042f336945f/airbyte_cdk/sources/file_based/stream/default_file_based_stream.py#L148-L154
This is being assembled as a FileTransferMessage here: https://github.com/airbytehq/airbyte-python-cdk/blob/837913f6372a6465472fae269dac7042f336945f/airbyte_cdk/sources/utils/record_helper.py#L39-L42
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see
"file_url" to fileMessage.fileUrl,
"file_relative_path" to fileMessage.fileRelativePath,
"source_file_url" to fileMessage.sourceFileUrl,
"modified" to fileMessage.modified,
"bytes" to fileMessage.bytes,
in the dest codebase, of which only file_relative_path
and file_relative_path
. So I'm planning on only adding those for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the PR to include file_url
and file_relative_path
@@ -76,6 +76,9 @@ definitions: | |||
meta: | |||
description: Information about this record added mid-sync | |||
"$ref": "#/definitions/AirbyteRecordMessageMeta" | |||
fileReference: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: shouldn't this be file_reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh actually it seems we mix styles like emitted_at
and connectionStatus
, so probably not an issue I guess.
Still, I favor the snake case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the keys on AirybteMessage are in camelCase for some reason, so I tried to follow the existing patterns. Upon further reading, however, I see # todo (cgardens) - prefer snake case for field names.
, so maybe snake case is preferred. I'll swap to snake case for now.
This reverts commit 3e4dbc3.
6872f62
to
b69e4ce
Compare
@@ -484,6 +503,10 @@ definitions: | |||
If the stream is resumable or not. Should be set to true if the stream supports incremental. Defaults to false. | |||
Primarily used by the Platform in Full Refresh to determine if a Full Refresh stream should actually be treated as incremental within a job. | |||
type: boolean | |||
is_file_based: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if we're representing files as something that can be included as part of standard records, maybe isntead of a whole stream being "file based" this is better expressed as "has files" or "has attachments"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call out. We can include that in the next iteration. Tentatively thinking has_file_attachment
@@ -545,6 +564,10 @@ definitions: | |||
If this is null, it means that the platform is not supporting the refresh and it is expected that no extra id will be added to the records and no data from previous generation will be cleanup. | |||
" | |||
type: integer | |||
copy_associated_file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this use-case equivalent to de-selecting the metadata fields you dont want, or is this a materially different way to sync?
What
Adds properties to
AirbyteStream
andConfiguredAirbyteStream
to describe whether a stream is file based and whether any associated files should be copiedAdds
AirbyteRecordMessageFileRefrence
struct to be included as optional fieldfileReference
toAirybteRecordMessage
for passing through copied file reference to the destinationUnblocks further spike work.
Next steps
Once agreed upon we can publish a "fake" version and the teams can start hacking