-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(file-mode-api: add filename extractor component #453
feat(file-mode-api: add filename extractor component #453
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a concern about file naming. Can you add more context to this? I would like to make sure we avoid collisions
|
||
relative_path = self._filename_extractor.eval(self.config, record=record) | ||
relative_path = relative_path.lstrip("/") | ||
file_relative_path = Path(relative_path) | ||
|
||
full_path = files_directory / file_relative_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have the stream name somewhere in there? It feels like multiple streams could have a file with the same name.
Even more than this, should we have a unique ID per file? It feels like there could even be two files in the same stream with the same name...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well , for Zendesk support, we do actually, e.g.:
filename_extractor: "{{ record.relative_path }}/{{ record.file_name }}/"
Interpolates as:
hc/article_attachments/"attachments_id"/"name _of_the_file.extension"
This works for this specific endpoint in Zendesk, but I can see it is not guaranteed for every connector in the future. So, I guess we can let the user add any extra path but make the component prefix to the path the stream and the attachment/file ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so what we are saying is that it is the developer's responsibility to make sure there are no clash. Could we remove this concern from the developer's and do it ourselves?
Regarding timing: I'm not 100% sure we need this right now and maybe we can make filename_extractor
optional in the future when we find a solution this this. On the top of my head, I can only see one way and it is when the stream declares a PK which seems to be common when I checked for Confluence, Jira and Salesforce so maybe this is viable in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so what we are saying is that it is the developer's responsibility to make sure there are no clash. Could we remove this concern from the developer's and do it ourselves?
No, I didn't make myself clear. I'm sorry about that. To reduce the risk of collisions, I will add the stream name + unique ID on the backend (CDK).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the logic for the unique ID? Autogenerated UUID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can:
- Add ourselves stream name to the path reducing collision risk
- Make filename_extractor optional so the developer can include a unique ID. There is a risk that he could mess up, but we can add some documentation to the component.
- Use Autogenerated UUID if filename_extractor is not present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accepting under the premise that we are fine that the connector developer ensure no file collisions for now
@maxi297 This is still true with slight modifications that we can refine in the future:
|
68480b7
into
aldogonzalez8/poc-emit-file-reference-record
Resolves https://github.com/airbytehq/airbyte-internal-issues/issues/12196