Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Branch for Files + Metadata spike protocol changes #111

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from
2 changes: 1 addition & 1 deletion .env
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# x-release-please-start-version
VERSION=0.14.2
VERSION=0.14.1337
# x-release-please-end
5 changes: 4 additions & 1 deletion .github/workflows/pull-request-publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,10 @@ jobs:
CLOUDREPO_USER: ${{ secrets.CLOUDREPO_USER }}
CLOUDREPO_PASSWORD: ${{ secrets.CLOUDREPO_PASSWORD }}
# will use the version in .env
run: ./gradlew publish
run: |
cat .env
echo VERSION: $VERSION
./gradlew publish

# python
- name: Generate Python Protocol Classes
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,9 @@ definitions:
meta:
description: Information about this record added mid-sync
"$ref": "#/definitions/AirbyteRecordMessageMeta"
file_reference:
description: A reference to assoicated file to be moved with along with the record data.
"$ref": "#/definitions/AirbyteRecordMessageFileReference"
AirbyteRecordMessageMeta:
type: object
additionalProperties: true
Expand Down Expand Up @@ -122,6 +125,22 @@ definitions:
- SOURCE_RETRIEVAL_ERROR
# Errors casting to appropriate type
- DESTINATION_TYPECAST_ERROR
AirbyteRecordMessageFileReference:
type: object
additionalProperties: true
properties:
file_url:
type: string
description: |-
The absolute path to the referenced file.
file_relative_path:
type: string
description: |-
The relative path to the referenced file in source.
file_size_bytes:
type: integer
description: |-
The size of the referenced file in bytes.
Comment on lines +140 to +143
Copy link
Contributor

@evantahler evantahler Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit/Question: If these file references will always be part of a stream that describes the file in more detail (original URI, mime-type, etc) wouldn't file_size_bytes be in the record.data json? Is that information duplicated here so that the destination and/or platform can decide if the file is too big (or small?) to deal with without deserializing record.data?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question: these top level fields are for internal airbyte consumption. This one specific would be consumed by the orchestrator to be tallied for billing purposes. The record.data may also contain this information, but I do not believe we want to be introspecting data fields (which could theoretically be de-selected / filtered) for internal book-keeping.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the current source-s3 implementation return {"file_url": absolute_file_path, "bytes": file_size, "file_relative_path": file_relative_path}. Do we need the relative path?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question, I can add both.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxi297 do you have a link to where we define the currently used properties today?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see

                "file_url" to fileMessage.fileUrl,
                "file_relative_path" to fileMessage.fileRelativePath,
                "source_file_url" to fileMessage.sourceFileUrl,
                "modified" to fileMessage.modified,
                "bytes" to fileMessage.bytes,

in the dest codebase, of which only file_relative_path and file_relative_path. So I'm planning on only adding those for now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR to include file_url and file_relative_path

AirbyteStateMessage:
type: object
additionalProperties: true
Expand Down Expand Up @@ -484,6 +503,10 @@ definitions:
If the stream is resumable or not. Should be set to true if the stream supports incremental. Defaults to false.
Primarily used by the Platform in Full Refresh to determine if a Full Refresh stream should actually be treated as incremental within a job.
type: boolean
is_file_based:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if we're representing files as something that can be included as part of standard records, maybe isntead of a whole stream being "file based" this is better expressed as "has files" or "has attachments"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call out. We can include that in the next iteration. Tentatively thinking has_file_attachment

type: boolean
description: |-
This stream represents a series of discrete files and their metadata.
ConfiguredAirbyteCatalog:
type: object
additionalProperties: true
Expand Down Expand Up @@ -545,6 +568,10 @@ definitions:
If this is null, it means that the platform is not supporting the refresh and it is expected that no extra id will be added to the records and no data from previous generation will be cleanup.
"
type: integer
copy_associated_file:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is lacking if we want to keep the current behavior as this implies that we will always sync the metadata. Today the behavior is "copy file only". It seems like we want to add two behaviors: "sync metadata as record" and "sync metadata as record and copy file". It feels like a simple boolean is not able to express those three options. Therefore, I have a couple of questions:

  • Is the hypothesis that we want to keep the current behavior true? It feels like if we want to give flexibility on reducing destination costs, it could be interesting for the user
  • If so, what would be an optimal way to represent that?

I think the source can always send the metadata as it can be filtered out by the platform like we does when a returned field is not part of the configured catalog.

Copy link

@aldogonzalez8 aldogonzalez8 Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like

Some ConfiguredAirbyteCatalog Stream:
{
   stream {
      ...
      "supported_sync_file_metadata_modes": [None, "only_metadata", "metadata_with_file_upload"]
      ...
   }
   ....
   sync_file_metadata_mode: "only-metadata"
   ...
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this use-case equivalent to de-selecting the metadata fields you dont want, or is this a materially different way to sync?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like it would be the equivalent yes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the source can always send the metadata as it can be filtered out by the platform like we does when a returned field is not part of the configured catalog.

Yea, I was thinking we may be able to just get by using the existing "column selection" feature to effectively provide this for us. cc @davinchia @matteogp to assist with product questions

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At our sync with @matteogp friday, we came to the conclusion supporting the "move files without metadata" use-case was not mission critical and could be added later if necessary. Recap notes of that meeting here

type: boolean
description: |-
If the stream is_file_based, the associated file will be copied to local disk and passed via reference along with its metadata.
SyncMode:
type: string
enum:
Expand Down
Loading