-
Notifications
You must be signed in to change notification settings - Fork 319
Apply obstore as storage backend #3033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Code Review Agent Run #39883aActionable Suggestions - 7
Additional Suggestions - 3
Review Details
|
Changelist by BitoThis pull request implements the following key changes.
|
driver_pod=self.driver_pod, | ||
executor_pod=self.executor_pod, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding driver_pod
and executor_pod
to the with_overrides
method to maintain consistency with the constructor parameters.
Code suggestion
Check the AI-generated fix before applying
@@ -56,6 +56,8 @@ def with_overrides(
new_spark_conf: Optional[Dict[str, str]] = None,
new_hadoop_conf: Optional[Dict[str, str]] = None,
new_databricks_conf: Optional[Dict[str, Dict]] = None,
+ driver_pod: Optional[K8sPod] = None,
+ executor_pod: Optional[K8sPod] = None,
) -> "SparkJob":
if not new_spark_conf:
new_spark_conf = self.spark_conf
@@ -65,6 +67,12 @@ def with_overrides(
if not new_databricks_conf:
new_databricks_conf = self.databricks_conf
+ if not driver_pod:
+ driver_pod = self.driver_pod
+
+ if not executor_pod:
+ executor_pod = self.executor_pod
+
return SparkJob(
spark_type=self.spark_type,
application_file=self.application_file,
Code Review Run #39883a
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
driverPod=self.driver_pod.to_flyte_idl() if self.driver_pod else None, | ||
executorPod=self.executor_pod.to_flyte_idl() if self.executor_pod else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding null checks for to_flyte_idl()
calls on driver_pod
and executor_pod
to avoid potential NoneType errors.
Code suggestion
Check the AI-generated fix before applying
driverPod=self.driver_pod.to_flyte_idl() if self.driver_pod else None, | |
executorPod=self.executor_pod.to_flyte_idl() if self.executor_pod else None, | |
driverPod=self.driver_pod.to_flyte_idl() if self.driver_pod and hasattr(self.driver_pod, 'to_flyte_idl') else None, | |
executorPod=self.executor_pod.to_flyte_idl() if self.executor_pod and hasattr(self.executor_pod, 'to_flyte_idl') else None, |
Code Review Run #39883a
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
flytekit/core/data_persistence.py
Outdated
if "file" in path: | ||
# no bucket for file | ||
return "", path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition if "file" in path
may match paths containing 'file' anywhere in the string, not just the protocol. Consider using if get_protocol(path) == "file"
for more precise protocol checking.
Code suggestion
Check the AI-generated fix before applying
if "file" in path: | |
# no bucket for file | |
return "", path | |
if get_protocol(path) == "file": | |
# no bucket for file | |
return "", path |
Code Review Run #39883a
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
flytekit/core/data_persistence.py
Outdated
support_types = ["s3", "gs", "abfs"] | ||
if protocol in support_types: | ||
file_path = "/".join(path_li[1:]) | ||
return (bucket, file_path) | ||
else: | ||
return bucket, path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list of supported storage types support_types = ['s3', 'gs', 'abfs']
could be defined as a module-level constant since it's used for validation. Consider moving it outside the function to improve maintainability.
Code suggestion
Check the AI-generated fix before applying
@@ -53,1 +53,2 @@
_ANON = "anon"
+SUPPORTED_STORAGE_TYPES = ["s3", "gs", "abfs"]
@@ -136,2 +136,1 @@
- support_types = ["s3", "gs", "abfs"]
- if protocol in support_types:
+ if protocol in SUPPORTED_STORAGE_TYPES:
Code Review Run #39883a
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
flytekit/core/data_persistence.py
Outdated
kwargs["store"] = store | ||
|
||
if anonymous: | ||
kwargs[_ANON] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using kwargs[_ANON] = anonymous
instead of hardcoding True
to maintain consistency with the input parameter value.
Code suggestion
Check the AI-generated fix before applying
kwargs[_ANON] = True | |
kwargs[_ANON] = anonymous |
Code Review Run #39883a
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
flytekit/core/data_persistence.py
Outdated
bucket, to_path_file_only = split_path(to_path) | ||
file_system = await self.get_async_filesystem_for_path(to_path, bucket) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider validating the bucket
parameter before passing it to get_async_filesystem_for_path()
. An empty bucket could cause issues with certain storage backends. Similar issues were also found in:
- flytekit/core/data_persistence.py (line 318)
- flytekit/core/data_persistence.py (line 521)
- flytekit/core/data_persistence.py (line 308)
Code suggestion
Check the AI-generated fix before applying
bucket, to_path_file_only = split_path(to_path) | |
file_system = await self.get_async_filesystem_for_path(to_path, bucket) | |
bucket, to_path_file_only = split_path(to_path) | |
protocol = get_protocol(to_path) | |
if protocol in ['s3', 'gs', 'abfs'] and not bucket: | |
raise ValueError(f'Bucket cannot be empty for {protocol} protocol') | |
file_system = await self.get_async_filesystem_for_path(to_path, bucket) |
Code Review Run #39883a
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
Code Review Agent Run #8926b7Actionable Suggestions - 0Review Details
|
successfully run it on local, not yet tested remote Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
7c76cc6
to
17bde4a
Compare
Signed-off-by: machichima <[email protected]>
Code Review Agent Run #0b7f4dActionable Suggestions - 4
Additional Suggestions - 1
Review Details
|
flytekit/core/data_persistence.py
Outdated
bucket, to_path_file_only = split_path(to_path) | ||
file_system = await self.get_async_filesystem_for_path(to_path, bucket) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider extracting the bucket and path splitting logic into a separate method to improve code reusability and maintainability. The split_path
function is used in multiple places and could be encapsulated better.
Code suggestion
Check the AI-generated fix before applying
bucket, to_path_file_only = split_path(to_path) | |
file_system = await self.get_async_filesystem_for_path(to_path, bucket) | |
bucket, path = self._split_and_get_bucket_path(to_path) | |
file_system = await self.get_async_filesystem_for_path(to_path, bucket) |
Code Review Run #0b7f4d
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
flytekit/core/data_persistence.py
Outdated
bucket, from_path_file_only = split_path(from_path) | ||
file_system = await self.get_async_filesystem_for_path(from_path, bucket) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider handling the case where split_path()
returns empty bucket for non-file protocols. Currently passing empty bucket to get_async_filesystem_for_path()
could cause issues with cloud storage access.
Code suggestion
Check the AI-generated fix before applying
bucket, from_path_file_only = split_path(from_path) | |
file_system = await self.get_async_filesystem_for_path(from_path, bucket) | |
bucket, from_path_file_only = split_path(from_path) | |
protocol = get_protocol(from_path) | |
if protocol not in ['file'] and not bucket: | |
raise ValueError(f'Empty bucket not allowed for protocol {protocol}') | |
file_system = await self.get_async_filesystem_for_path(from_path, bucket) |
Code Review Run #0b7f4d
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
flytekit/core/data_persistence.py
Outdated
fsspec.register_implementation("s3", AsyncFsspecStore) | ||
fsspec.register_implementation("gs", AsyncFsspecStore) | ||
fsspec.register_implementation("abfs", AsyncFsspecStore) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider moving the fsspec
implementation registrations to a more appropriate initialization location, such as a module-level __init__.py
or a dedicated setup function. This would improve code organization and make the registrations more discoverable.
Code suggestion
Check the AI-generated fix before applying
fsspec.register_implementation("s3", AsyncFsspecStore) | |
fsspec.register_implementation("gs", AsyncFsspecStore) | |
fsspec.register_implementation("abfs", AsyncFsspecStore) | |
def register_fsspec_implementations(): | |
fsspec.register_implementation("s3", AsyncFsspecStore) | |
fsspec.register_implementation("gs", AsyncFsspecStore) | |
fsspec.register_implementation("abfs", AsyncFsspecStore) | |
# Call during module initialization | |
register_fsspec_implementations() |
Code Review Run #0b7f4d
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
Specify the class properties for each file storage Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Code Review Agent Run Status
|
Signed-off-by: machichima <[email protected]>
Code Review Agent Run Status
|
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Code Review Agent Run Status
|
flytekit/core/data_persistence.py
Outdated
_ANON = "anon" | ||
_FSSPEC_S3_KEY_ID = "access_key_id" | ||
_FSSPEC_S3_SECRET = "secret_access_key" | ||
_ANON = "skip_signature" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_ANON = "skip_signature" | |
_SKIP_SIGNATURE = "skip_signature" |
pyproject.toml
Outdated
@@ -38,6 +38,7 @@ dependencies = [ | |||
"marshmallow-jsonschema>=0.12.0", | |||
"mashumaro>=3.15", | |||
"msgpack>=1.1.0", | |||
"obstore", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we remove adlfs, gcsfs, and s3fs as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
flytekit/core/data_persistence.py
Outdated
if azure_cfg.tenant_id: | ||
kwargs["tenant_id"] = azure_cfg.tenant_id | ||
kwargs[_ANON] = anonymous | ||
return kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need to return anything for gcs? if so, could we just remove this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there's no need to return any config as long as GOOGLE_APPLICATION_CREDENTIALS
env variable is set. Origin flytekit version only add "token" for anonymous mode, which is not needed in obstore.
I will remove this function then
flytekit/core/data_persistence.py
Outdated
@@ -664,6 +750,9 @@ async def async_put_data( | |||
put_data = loop_manager.synced(async_put_data) | |||
|
|||
|
|||
register(["s3", "gs", "abfs"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need abfss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I just added. Also updated here so that it is more obvious that abfss
is needed
flytekit/flytekit/core/data_persistence.py
Line 280 in 6346fc3
elif protocol in ("abfs", "abfss"): |
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Code Review Agent Run Status
|
Signed-off-by: machichima <[email protected]>
Code Review Agent Run Status
|
Signed-off-by: machichima <[email protected]>
Code Review Agent Run Status
|
Signed-off-by: machichima <[email protected]>
Code Review Agent Run Status
|
Dockerfile.dev
Outdated
@@ -40,6 +40,7 @@ RUN SETUPTOOLS_SCM_PRETEND_VERSION_FOR_FLYTEKIT=$PSEUDO_VERSION \ | |||
-e /flytekit \ | |||
-e /flytekit/plugins/flytekit-deck-standard \ | |||
-e /flytekit/plugins/flytekit-flyteinteractive \ | |||
obstore==0.3.0b9 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this needed or can we put it in a dev requirements file? if we need it, should we bump to 0.6.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry I forgot to change this. I'll update this and pyproject.toml as well to 0.6.0
Signed-off-by: machichima <[email protected]>
28d137c
to
c90a2ff
Compare
Code Review Agent Run Status
|
Tracking issue
Related to flyteorg/flyte#4081
Why are the changes needed?
Use a Rust/Pyo3 package - obstore - as the storage backend for cloud storages. This provides the smaller dependencies size and enable users to use their own s3fs, gsfs, abfs, ... version.
What changes were proposed in this pull request?
Use obstore as the storage backend to replace s3fs, gsfs, and abfs.
How was this patch tested?
Setup process
Screenshots
Performance
Check all the applicable boxes
Related PRs
Docs link
Summary by Bito
This PR implements obstore as the new storage backend for Flytekit, replacing existing implementations for S3, GCS, and Azure. The changes focus on improving storage option handling and bucket information management, with significant updates to core storage components and data persistence layer. Enhanced test coverage has been added to verify the new implementation and ensure compatibility.Unit tests added: True
Estimated effort to review (1-5, lower is better): 5