Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Branch for Files + Metadata spike protocol changes #111

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

tryangul
Copy link

What

  1. Adds properties to AirbyteStream and ConfiguredAirbyteStream to describe whether a stream is file based and whether any associated files should be copied

  2. Adds AirbyteRecordMessageFileRefrence struct to be included as optional field fileReference to AirybteRecordMessage for passing through copied file reference to the destination

Unblocks further spike work.

Next steps

Once agreed upon we can publish a "fake" version and the teams can start hacking

Copy link
Contributor

github-actions bot commented Mar 24, 2025

Hey there and thank you for opening this pull request! 👋🏼

We require pull request titles to follow the Conventional Commits specification and it looks like your proposed title needs to be adjusted.

Details:

Unknown release type "RFC" found in pull request title "RFC: Branch for Files + Metadata spike protocol changes". 

Available types:
 - feat: A new feature
 - fix: A bug fix
 - docs: Documentation only changes
 - style: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc)
 - refactor: A code change that neither fixes a bug nor adds a feature
 - perf: A code change that improves performance
 - test: Adding missing tests or correcting existing tests
 - build: Changes that affect the build system or external dependencies (example scopes: gulp, broccoli, npm)
 - ci: Changes to our CI configuration files and scripts (example scopes: Travis, Circle, BrowserStack, SauceLabs)
 - chore: Other changes that don't modify src or test files
 - revert: Reverts a previous commit

@@ -545,6 +564,10 @@ definitions:
If this is null, it means that the platform is not supporting the refresh and it is expected that no extra id will be added to the records and no data from previous generation will be cleanup.
"
type: integer
copy_associated_file:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is lacking if we want to keep the current behavior as this implies that we will always sync the metadata. Today the behavior is "copy file only". It seems like we want to add two behaviors: "sync metadata as record" and "sync metadata as record and copy file". It feels like a simple boolean is not able to express those three options. Therefore, I have a couple of questions:

  • Is the hypothesis that we want to keep the current behavior true? It feels like if we want to give flexibility on reducing destination costs, it could be interesting for the user
  • If so, what would be an optimal way to represent that?

I think the source can always send the metadata as it can be filtered out by the platform like we does when a returned field is not part of the configured catalog.

Copy link

@aldogonzalez8 aldogonzalez8 Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like

Some ConfiguredAirbyteCatalog Stream:
{
   stream {
      ...
      "supported_sync_file_metadata_modes": [None, "only_metadata", "metadata_with_file_upload"]
      ...
   }
   ....
   sync_file_metadata_mode: "only-metadata"
   ...
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this use-case equivalent to de-selecting the metadata fields you dont want, or is this a materially different way to sync?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like it would be the equivalent yes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the source can always send the metadata as it can be filtered out by the platform like we does when a returned field is not part of the configured catalog.

Yea, I was thinking we may be able to just get by using the existing "column selection" feature to effectively provide this for us. cc @davinchia @matteogp to assist with product questions

Comment on lines +136 to +139
file_size_bytes:
type: integer
description: |-
The size of the referenced file in bytes.
Copy link
Contributor

@evantahler evantahler Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit/Question: If these file references will always be part of a stream that describes the file in more detail (original URI, mime-type, etc) wouldn't file_size_bytes be in the record.data json? Is that information duplicated here so that the destination and/or platform can decide if the file is too big (or small?) to deal with without deserializing record.data?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question: these top level fields are for internal airbyte consumption. This one specific would be consumed by the orchestrator to be tallied for billing purposes. The record.data may also contain this information, but I do not believe we want to be introspecting data fields (which could theoretically be de-selected / filtered) for internal book-keeping.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the current source-s3 implementation return {"file_url": absolute_file_path, "bytes": file_size, "file_relative_path": file_relative_path}. Do we need the relative path?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question, I can add both.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxi297 do you have a link to where we define the currently used properties today?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see

                "file_url" to fileMessage.fileUrl,
                "file_relative_path" to fileMessage.fileRelativePath,
                "source_file_url" to fileMessage.sourceFileUrl,
                "modified" to fileMessage.modified,
                "bytes" to fileMessage.bytes,

in the dest codebase, of which only file_relative_path and file_relative_path. So I'm planning on only adding those for now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR to include file_url and file_relative_path

@@ -76,6 +76,9 @@ definitions:
meta:
description: Information about this record added mid-sync
"$ref": "#/definitions/AirbyteRecordMessageMeta"
fileReference:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shouldn't this be file_reference?

Copy link

@aldogonzalez8 aldogonzalez8 Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh actually it seems we mix styles like emitted_atand connectionStatus, so probably not an issue I guess.

Still, I favor the snake case.

Copy link
Author

@tryangul tryangul Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the keys on AirybteMessage are in camelCase for some reason, so I tried to follow the existing patterns. Upon further reading, however, I see # todo (cgardens) - prefer snake case for field names., so maybe snake case is preferred. I'll swap to snake case for now.

@tryangul tryangul force-pushed the rbroughan/stream-based-file-support-WIP branch from 6872f62 to b69e4ce Compare March 24, 2025 23:14
@@ -484,6 +503,10 @@ definitions:
If the stream is resumable or not. Should be set to true if the stream supports incremental. Defaults to false.
Primarily used by the Platform in Full Refresh to determine if a Full Refresh stream should actually be treated as incremental within a job.
type: boolean
is_file_based:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if we're representing files as something that can be included as part of standard records, maybe isntead of a whole stream being "file based" this is better expressed as "has files" or "has attachments"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call out. We can include that in the next iteration. Tentatively thinking has_file_attachment

@@ -545,6 +564,10 @@ definitions:
If this is null, it means that the platform is not supporting the refresh and it is expected that no extra id will be added to the records and no data from previous generation will be cleanup.
"
type: integer
copy_associated_file:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this use-case equivalent to de-selecting the metadata fields you dont want, or is this a materially different way to sync?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants