RFC: Branch for Files + Metadata spike protocol changes #111

tryangul · 2025-03-24T19:24:58Z

What

Adds properties to AirbyteStream and ConfiguredAirbyteStream to describe whether a stream is file based and whether any associated files should be copied
Adds AirbyteRecordMessageFileRefrence struct to be included as optional field fileReference to AirybteRecordMessage for passing through copied file reference to the destination

Unblocks further spike work.

Next steps

Once agreed upon we can publish a "fake" version and the teams can start hacking

github-actions · 2025-03-24T19:25:13Z

Hey there and thank you for opening this pull request! 👋🏼

We require pull request titles to follow the Conventional Commits specification and it looks like your proposed title needs to be adjusted.

Details:

Unknown release type "RFC" found in pull request title "RFC: Branch for Files + Metadata spike protocol changes". 

Available types:
 - feat: A new feature
 - fix: A bug fix
 - docs: Documentation only changes
 - style: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc)
 - refactor: A code change that neither fixes a bug nor adds a feature
 - perf: A code change that improves performance
 - test: Adding missing tests or correcting existing tests
 - build: Changes that affect the build system or external dependencies (example scopes: gulp, broccoli, npm)
 - ci: Changes to our CI configuration files and scripts (example scopes: Travis, Circle, BrowserStack, SauceLabs)
 - chore: Other changes that don't modify src or test files
 - revert: Reverts a previous commit

maxi297 · 2025-03-24T19:53:58Z

protocol-models/src/main/resources/airbyte_protocol/airbyte_protocol.yaml

@@ -545,6 +564,10 @@ definitions:
          If this is null, it means that the platform is not supporting the refresh and it is expected that no extra id will be added to the records and no data from previous generation will be cleanup.
          "
        type: integer
+      copy_associated_file:


I think this is lacking if we want to keep the current behavior as this implies that we will always sync the metadata. Today the behavior is "copy file only". It seems like we want to add two behaviors: "sync metadata as record" and "sync metadata as record and copy file". It feels like a simple boolean is not able to express those three options. Therefore, I have a couple of questions:

Is the hypothesis that we want to keep the current behavior true? It feels like if we want to give flexibility on reducing destination costs, it could be interesting for the user

If so, what would be an optimal way to represent that?

I think the source can always send the metadata as it can be filtered out by the platform like we does when a returned field is not part of the configured catalog.

Maybe something like

Some ConfiguredAirbyteCatalog Stream: { stream { ... "supported_sync_file_metadata_modes": [None, "only_metadata", "metadata_with_file_upload"] ... } .... sync_file_metadata_mode: "only-metadata" ... }

is this use-case equivalent to de-selecting the metadata fields you dont want, or is this a materially different way to sync?

It sounds like it would be the equivalent yes

I think the source can always send the metadata as it can be filtered out by the platform like we does when a returned field is not part of the configured catalog.

Yea, I was thinking we may be able to just get by using the existing "column selection" feature to effectively provide this for us. cc @davinchia @matteogp to assist with product questions

evantahler · 2025-03-24T19:58:34Z

protocol-models/src/main/resources/airbyte_protocol/airbyte_protocol.yaml

+      file_size_bytes:
+        type: integer
+        description: |-
+          The size of the referenced file in bytes.


Nit/Question: If these file references will always be part of a stream that describes the file in more detail (original URI, mime-type, etc) wouldn't file_size_bytes be in the record.data json? Is that information duplicated here so that the destination and/or platform can decide if the file is too big (or small?) to deal with without deserializing record.data?

Good question: these top level fields are for internal airbyte consumption. This one specific would be consumed by the orchestrator to be tallied for billing purposes. The record.data may also contain this information, but I do not believe we want to be introspecting data fields (which could theoretically be de-selected / filtered) for internal book-keeping.

Note that the current source-s3 implementation return {"file_url": absolute_file_path, "bytes": file_size, "file_relative_path": file_relative_path}. Do we need the relative path?

good question, I can add both.

@maxi297 do you have a link to where we define the currently used properties today?

It is a bit spread as it is defined in each source but here is the example for source-s3: https://github.com/airbytehq/airbyte/blob/ab088457429302b11d8909cad90631ad0f2a9f64/airbyte-integrations/connectors/source-s3/source_s3/v4/stream_reader.py#L261.

We also add some information as part of the CDK: https://github.com/airbytehq/airbyte-python-cdk/blob/837913f6372a6465472fae269dac7042f336945f/airbyte_cdk/sources/file_based/stream/default_file_based_stream.py#L148-L154

This is being assembled as a FileTransferMessage here: https://github.com/airbytehq/airbyte-python-cdk/blob/837913f6372a6465472fae269dac7042f336945f/airbyte_cdk/sources/utils/record_helper.py#L39-L42

I see

"file_url" to fileMessage.fileUrl, "file_relative_path" to fileMessage.fileRelativePath, "source_file_url" to fileMessage.sourceFileUrl, "modified" to fileMessage.modified, "bytes" to fileMessage.bytes,

in the dest codebase, of which only file_relative_path and file_relative_path. So I'm planning on only adding those for now.

Updated the PR to include file_url and file_relative_path

aldogonzalez8 · 2025-03-24T21:44:24Z

protocol-models/src/main/resources/airbyte_protocol/airbyte_protocol.yaml

@@ -76,6 +76,9 @@ definitions:
      meta:
        description: Information about this record added mid-sync
        "$ref": "#/definitions/AirbyteRecordMessageMeta"
+      fileReference:


nit: shouldn't this be file_reference?

Oh actually it seems we mix styles like emitted_atand connectionStatus, so probably not an issue I guess.

Still, I favor the snake case.

Some of the keys on AirybteMessage are in camelCase for some reason, so I tried to follow the existing patterns. Upon further reading, however, I see # todo (cgardens) - prefer snake case for field names., so maybe snake case is preferred. I'll swap to snake case for now.

This reverts commit 3e4dbc3.

pedroslopez · 2025-03-26T17:18:10Z

protocol-models/src/main/resources/airbyte_protocol/airbyte_protocol.yaml

@@ -484,6 +503,10 @@ definitions:
          If the stream is resumable or not. Should be set to true if the stream supports incremental. Defaults to false.
          Primarily used by the Platform in Full Refresh to determine if a Full Refresh stream should actually be treated as incremental within a job.
        type: boolean
+      is_file_based:


nit: if we're representing files as something that can be included as part of standard records, maybe isntead of a whole stream being "file based" this is better expressed as "has files" or "has attachments"

Good call out. We can include that in the next iteration. Tentatively thinking has_file_attachment

pedroslopez · 2025-03-26T17:20:03Z

protocol-models/src/main/resources/airbyte_protocol/airbyte_protocol.yaml

@@ -545,6 +564,10 @@ definitions:
          If this is null, it means that the platform is not supporting the refresh and it is expected that no extra id will be added to the records and no data from previous generation will be cleanup.
          "
        type: integer
+      copy_associated_file:


is this use-case equivalent to de-selecting the metadata fields you dont want, or is this a materially different way to sync?

tryangul added 2 commits March 24, 2025 12:19

Add proposed changes to stream configuration and the record.

dfeb195

Data -> Reference

edc46db

tryangul requested review from gosusnp, maxi297, pedroslopez, davinchia and aldogonzalez8 March 24, 2025 19:24

maxi297 reviewed Mar 24, 2025

View reviewed changes

evantahler reviewed Mar 24, 2025

View reviewed changes

Add relative_path. local_path -> file_url.

bc9eb42

aldogonzalez8 reviewed Mar 24, 2025

View reviewed changes

tryangul added 7 commits March 24, 2025 14:44

Add increment version with -SNAPSHOT tag.

e0eed9f

fileReference -> file_reference.

35c28ad

Use purely numerical, but clearly fake version to fix publish.

b307145

See if CI publishes more orthodox version = 0.14.3

3e4dbc3

Revert "See if CI publishes more orthodox version = 0.14.3"

aeb37dc

This reverts commit 3e4dbc3.

cat version before gradle publish.

cd0a191

Echo VERSION.

b69e4ce

tryangul force-pushed the rbroughan/stream-based-file-support-WIP branch from 6872f62 to b69e4ce Compare March 24, 2025 23:14

pedroslopez reviewed Mar 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Branch for Files + Metadata spike protocol changes #111

RFC: Branch for Files + Metadata spike protocol changes #111

tryangul commented Mar 24, 2025

github-actions bot commented Mar 24, 2025 •

edited

Loading

maxi297 Mar 24, 2025

aldogonzalez8 Mar 24, 2025 •

edited

Loading

pedroslopez Mar 26, 2025

maxi297 Mar 26, 2025

tryangul Mar 26, 2025

evantahler Mar 24, 2025 •

edited

Loading

tryangul Mar 24, 2025

maxi297 Mar 24, 2025

tryangul Mar 24, 2025

tryangul Mar 24, 2025

maxi297 Mar 24, 2025

tryangul Mar 24, 2025

tryangul Mar 24, 2025

aldogonzalez8 Mar 24, 2025

aldogonzalez8 Mar 24, 2025 •

edited

Loading

tryangul Mar 24, 2025 •

edited

Loading

pedroslopez Mar 26, 2025

tryangul Mar 26, 2025

pedroslopez Mar 26, 2025

RFC: Branch for Files + Metadata spike protocol changes #111

Are you sure you want to change the base?

RFC: Branch for Files + Metadata spike protocol changes #111

Conversation

tryangul commented Mar 24, 2025

What

Next steps

github-actions bot commented Mar 24, 2025 • edited Loading

Choose a reason for hiding this comment

aldogonzalez8 Mar 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evantahler Mar 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aldogonzalez8 Mar 24, 2025 • edited Loading

Choose a reason for hiding this comment

tryangul Mar 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Mar 24, 2025 •

edited

Loading

aldogonzalez8 Mar 24, 2025 •

edited

Loading

evantahler Mar 24, 2025 •

edited

Loading

aldogonzalez8 Mar 24, 2025 •

edited

Loading

tryangul Mar 24, 2025 •

edited

Loading