Skip to content

Apply obstore as storage backend #3033

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 40 commits into
base: master
Choose a base branch
from

Conversation

machichima
Copy link
Contributor

@machichima machichima commented Jan 4, 2025

Tracking issue

Related to flyteorg/flyte#4081

Why are the changes needed?

Use a Rust/Pyo3 package - obstore - as the storage backend for cloud storages. This provides the smaller dependencies size and enable users to use their own s3fs, gsfs, abfs, ... version.

What changes were proposed in this pull request?

Use obstore as the storage backend to replace s3fs, gsfs, and abfs.

How was this patch tested?

Setup process

Screenshots

Performance

  • put file to minio

put_file_runtime

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

Summary by Bito

This PR implements obstore as the new storage backend for Flytekit, replacing existing implementations for S3, GCS, and Azure. The changes focus on improving storage option handling and bucket information management, with significant updates to core storage components and data persistence layer. Enhanced test coverage has been added to verify the new implementation and ensure compatibility.

Unit tests added: True

Estimated effort to review (1-5, lower is better): 5

@flyte-bot
Copy link
Contributor

flyte-bot commented Jan 4, 2025

Code Review Agent Run #39883a

Actionable Suggestions - 7
  • plugins/flytekit-spark/flytekitplugins/spark/models.py - 2
    • Missing pod parameters in with_overrides method · Line 79-80
    • Consider adding null validation checks · Line 193-194
  • flytekit/core/data_persistence.py - 5
Additional Suggestions - 3
  • flytekit/core/data_persistence.py - 2
    • Consider optimizing bucket extraction timing · Line 521-522
    • Consider combining empty dict initializations · Line 59-60
  • plugins/flytekit-spark/tests/test_spark_task.py - 1
Review Details
  • Files reviewed - 5 · Commit Range: 64c6c79..0187150
    • Dockerfile.dev
    • flytekit/core/data_persistence.py
    • plugins/flytekit-spark/flytekitplugins/spark/models.py
    • plugins/flytekit-spark/flytekitplugins/spark/task.py
    • plugins/flytekit-spark/tests/test_spark_task.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by Bito Logo

@flyte-bot
Copy link
Contributor

flyte-bot commented Jan 4, 2025

Changelist by Bito

This pull request implements the following key changes.

Key Change Files Impacted
Feature Improvement - Storage Backend Migration to obstore

data_persistence.py - Refactored storage backend implementation to use obstore instead of direct cloud storage SDKs

obstore_filesystem.py - Added new filesystem implementations for S3, GCS and Azure using obstore

basic_dfs.py - Updated storage options handling to support obstore integration

sd_transformers.py - Modified storage path handling for Polars integration

Testing - Enhanced Storage Backend Tests

test_data.py - Added comprehensive tests for obstore integration and storage operations

Other Improvements - Dependency Updates

Dockerfile.dev - Added obstore dependency

pyproject.toml - Added obstore package dependency

Feature Improvement - Storage Backend Migration to obstore

data_persistence.py - Refactored storage backend implementation to use obstore instead of direct cloud storage SDKs

obstore_filesystem.py - Added new filesystem implementations for S3, GCS and Azure using obstore

basic_dfs.py - Updated storage options handling to support obstore integration

sd_transformers.py - Modified storage path handling for Polars integration

Testing - Enhanced Test Coverage for Storage Operations

test_data.py - Updated tests for GCS and Azure storage options with mock implementations

test_data_persistence.py - Enhanced Azure storage provider tests with base64 encoding and mock store

test_flyte_directory.py - Updated S3 filesystem tests to use new obstore implementation

Other Improvements - Dependency Updates

Dockerfile.dev - Added obstore dependency

pyproject.toml - Added obstore package dependency

Comment on lines 79 to 80
driver_pod=self.driver_pod,
executor_pod=self.executor_pod,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing pod parameters in with_overrides method

Consider adding driver_pod and executor_pod to the with_overrides method to maintain consistency with the constructor parameters.

Code suggestion
Check the AI-generated fix before applying
 @@ -56,6 +56,8 @@ def with_overrides(
          new_spark_conf: Optional[Dict[str, str]] = None,
          new_hadoop_conf: Optional[Dict[str, str]] = None,
          new_databricks_conf: Optional[Dict[str, Dict]] = None,
 +        driver_pod: Optional[K8sPod] = None,
 +        executor_pod: Optional[K8sPod] = None,
      ) -> "SparkJob":
          if not new_spark_conf:
              new_spark_conf = self.spark_conf
 @@ -65,6 +67,12 @@ def with_overrides(
          if not new_databricks_conf:
              new_databricks_conf = self.databricks_conf
 
 +        if not driver_pod:
 +            driver_pod = self.driver_pod
 +
 +        if not executor_pod:
 +            executor_pod = self.executor_pod
 +
          return SparkJob(
              spark_type=self.spark_type,
              application_file=self.application_file,

Code Review Run #39883a


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines 193 to 194
driverPod=self.driver_pod.to_flyte_idl() if self.driver_pod else None,
executorPod=self.executor_pod.to_flyte_idl() if self.executor_pod else None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding null validation checks

Consider adding null checks for to_flyte_idl() calls on driver_pod and executor_pod to avoid potential NoneType errors.

Code suggestion
Check the AI-generated fix before applying
Suggested change
driverPod=self.driver_pod.to_flyte_idl() if self.driver_pod else None,
executorPod=self.executor_pod.to_flyte_idl() if self.executor_pod else None,
driverPod=self.driver_pod.to_flyte_idl() if self.driver_pod and hasattr(self.driver_pod, 'to_flyte_idl') else None,
executorPod=self.executor_pod.to_flyte_idl() if self.executor_pod and hasattr(self.executor_pod, 'to_flyte_idl') else None,

Code Review Run #39883a


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines 119 to 121
if "file" in path:
# no bucket for file
return "", path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve file protocol detection precision

The condition if "file" in path may match paths containing 'file' anywhere in the string, not just the protocol. Consider using if get_protocol(path) == "file" for more precise protocol checking.

Code suggestion
Check the AI-generated fix before applying
Suggested change
if "file" in path:
# no bucket for file
return "", path
if get_protocol(path) == "file":
# no bucket for file
return "", path

Code Review Run #39883a


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines 136 to 141
support_types = ["s3", "gs", "abfs"]
if protocol in support_types:
file_path = "/".join(path_li[1:])
return (bucket, file_path)
else:
return bucket, path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider moving storage types to constant

The list of supported storage types support_types = ['s3', 'gs', 'abfs'] could be defined as a module-level constant since it's used for validation. Consider moving it outside the function to improve maintainability.

Code suggestion
Check the AI-generated fix before applying
 @@ -53,1 +53,2 @@
  _ANON = "anon"
 +SUPPORTED_STORAGE_TYPES = ["s3", "gs", "abfs"]
 @@ -136,2 +136,1 @@
 -        support_types = ["s3", "gs", "abfs"]
 -        if protocol in support_types:
 +        if protocol in SUPPORTED_STORAGE_TYPES:

Code Review Run #39883a


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

kwargs["store"] = store

if anonymous:
kwargs[_ANON] = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using anonymous parameter for _ANON

Consider using kwargs[_ANON] = anonymous instead of hardcoding True to maintain consistency with the input parameter value.

Code suggestion
Check the AI-generated fix before applying
Suggested change
kwargs[_ANON] = True
kwargs[_ANON] = anonymous

Code Review Run #39883a


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines 433 to 434
bucket, to_path_file_only = split_path(to_path)
file_system = await self.get_async_filesystem_for_path(to_path, bucket)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider validating bucket before filesystem call

Consider validating the bucket parameter before passing it to get_async_filesystem_for_path(). An empty bucket could cause issues with certain storage backends. Similar issues were also found in:

  • flytekit/core/data_persistence.py (line 318)
  • flytekit/core/data_persistence.py (line 521)
  • flytekit/core/data_persistence.py (line 308)
Code suggestion
Check the AI-generated fix before applying
Suggested change
bucket, to_path_file_only = split_path(to_path)
file_system = await self.get_async_filesystem_for_path(to_path, bucket)
bucket, to_path_file_only = split_path(to_path)
protocol = get_protocol(to_path)
if protocol in ['s3', 'gs', 'abfs'] and not bucket:
raise ValueError(f'Bucket cannot be empty for {protocol} protocol')
file_system = await self.get_async_filesystem_for_path(to_path, bucket)

Code Review Run #39883a


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

@flyte-bot
Copy link
Contributor

flyte-bot commented Jan 4, 2025

Code Review Agent Run #8926b7

Actionable Suggestions - 0
Review Details
  • Files reviewed - 1 · Commit Range: 0187150..7c76cc6
    • pyproject.toml
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by Bito Logo

@flyte-bot
Copy link
Contributor

flyte-bot commented Jan 5, 2025

Code Review Agent Run #0b7f4d

Actionable Suggestions - 4
  • flytekit/core/data_persistence.py - 4
Additional Suggestions - 1
  • flytekit/core/data_persistence.py - 1
    • Consider combining dictionary initializations · Line 59-60
Review Details
  • Files reviewed - 3 · Commit Range: 58ba73c..353f000
    • Dockerfile.dev
    • flytekit/core/data_persistence.py
    • pyproject.toml
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by Bito Logo

Comment on lines 433 to 434
bucket, to_path_file_only = split_path(to_path)
file_system = await self.get_async_filesystem_for_path(to_path, bucket)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider extracting path splitting logic

Consider extracting the bucket and path splitting logic into a separate method to improve code reusability and maintainability. The split_path function is used in multiple places and could be encapsulated better.

Code suggestion
Check the AI-generated fix before applying
Suggested change
bucket, to_path_file_only = split_path(to_path)
file_system = await self.get_async_filesystem_for_path(to_path, bucket)
bucket, path = self._split_and_get_bucket_path(to_path)
file_system = await self.get_async_filesystem_for_path(to_path, bucket)

Code Review Run #0b7f4d


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines 391 to 392
bucket, from_path_file_only = split_path(from_path)
file_system = await self.get_async_filesystem_for_path(from_path, bucket)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handle empty bucket case for storage

Consider handling the case where split_path() returns empty bucket for non-file protocols. Currently passing empty bucket to get_async_filesystem_for_path() could cause issues with cloud storage access.

Code suggestion
Check the AI-generated fix before applying
Suggested change
bucket, from_path_file_only = split_path(from_path)
file_system = await self.get_async_filesystem_for_path(from_path, bucket)
bucket, from_path_file_only = split_path(from_path)
protocol = get_protocol(from_path)
if protocol not in ['file'] and not bucket:
raise ValueError(f'Empty bucket not allowed for protocol {protocol}')
file_system = await self.get_async_filesystem_for_path(from_path, bucket)

Code Review Run #0b7f4d


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines 735 to 737
fsspec.register_implementation("s3", AsyncFsspecStore)
fsspec.register_implementation("gs", AsyncFsspecStore)
fsspec.register_implementation("abfs", AsyncFsspecStore)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider relocating fsspec implementation registrations

Consider moving the fsspec implementation registrations to a more appropriate initialization location, such as a module-level __init__.py or a dedicated setup function. This would improve code organization and make the registrations more discoverable.

Code suggestion
Check the AI-generated fix before applying
Suggested change
fsspec.register_implementation("s3", AsyncFsspecStore)
fsspec.register_implementation("gs", AsyncFsspecStore)
fsspec.register_implementation("abfs", AsyncFsspecStore)
def register_fsspec_implementations():
fsspec.register_implementation("s3", AsyncFsspecStore)
fsspec.register_implementation("gs", AsyncFsspecStore)
fsspec.register_implementation("abfs", AsyncFsspecStore)
# Call during module initialization
register_fsspec_implementations()

Code Review Run #0b7f4d


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Specify the class properties for each file storage

Signed-off-by: machichima <[email protected]>
@flyte-bot
Copy link
Contributor

flyte-bot commented Mar 4, 2025

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - Bito Code Review Agent didn't review this pull request automatically because it exceeded the size limit. No action is needed if you didn't intend for the agent to review it. Otherwise, you can initiate the review by typing /review in a comment below.

Signed-off-by: machichima <[email protected]>
@flyte-bot
Copy link
Contributor

flyte-bot commented Mar 6, 2025

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - Bito Code Review Agent didn't review this pull request automatically because it exceeded the size limit. No action is needed if you didn't intend for the agent to review it. Otherwise, you can initiate the review by typing /review in a comment below.

Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
@flyte-bot
Copy link
Contributor

flyte-bot commented Mar 6, 2025

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - Bito Code Review Agent didn't review this pull request automatically because it exceeded the size limit. No action is needed if you didn't intend for the agent to review it. Otherwise, you can initiate the review by typing /review in a comment below.

_ANON = "anon"
_FSSPEC_S3_KEY_ID = "access_key_id"
_FSSPEC_S3_SECRET = "secret_access_key"
_ANON = "skip_signature"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_ANON = "skip_signature"
_SKIP_SIGNATURE = "skip_signature"

pyproject.toml Outdated
@@ -38,6 +38,7 @@ dependencies = [
"marshmallow-jsonschema>=0.12.0",
"mashumaro>=3.15",
"msgpack>=1.1.0",
"obstore",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we remove adlfs, gcsfs, and s3fs as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

if azure_cfg.tenant_id:
kwargs["tenant_id"] = azure_cfg.tenant_id
kwargs[_ANON] = anonymous
return kwargs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to return anything for gcs? if so, could we just remove this function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there's no need to return any config as long as GOOGLE_APPLICATION_CREDENTIALS env variable is set. Origin flytekit version only add "token" for anonymous mode, which is not needed in obstore.
I will remove this function then

@@ -664,6 +750,9 @@ async def async_put_data(
put_data = loop_manager.synced(async_put_data)


register(["s3", "gs", "abfs"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need abfss

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I just added. Also updated here so that it is more obvious that abfss is needed

elif protocol in ("abfs", "abfss"):

@flyte-bot
Copy link
Contributor

flyte-bot commented Mar 18, 2025

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - Bito Code Review Agent didn't review this pull request automatically because it exceeded the size limit. No action is needed if you didn't intend for the agent to review it. Otherwise, you can initiate the review by typing /review in a comment below.

Signed-off-by: machichima <[email protected]>
@flyte-bot
Copy link
Contributor

flyte-bot commented Mar 19, 2025

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - Bito Code Review Agent didn't review this pull request automatically because it exceeded the size limit. No action is needed if you didn't intend for the agent to review it. Otherwise, you can initiate the review by typing /review in a comment below.

Signed-off-by: machichima <[email protected]>
@flyte-bot
Copy link
Contributor

flyte-bot commented Mar 20, 2025

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - Bito Code Review Agent didn't review this pull request automatically because it exceeded the size limit. No action is needed if you didn't intend for the agent to review it. Otherwise, you can initiate the review by typing /review in a comment below.

@flyte-bot
Copy link
Contributor

flyte-bot commented Mar 20, 2025

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - Bito Code Review Agent didn't review this pull request automatically because it exceeded the size limit. No action is needed if you didn't intend for the agent to review it. Otherwise, you can initiate the review by typing /review in a comment below.

@flyteorg flyteorg deleted a comment from flyte-bot Apr 7, 2025
Dockerfile.dev Outdated
@@ -40,6 +40,7 @@ RUN SETUPTOOLS_SCM_PRETEND_VERSION_FOR_FLYTEKIT=$PSEUDO_VERSION \
-e /flytekit \
-e /flytekit/plugins/flytekit-deck-standard \
-e /flytekit/plugins/flytekit-flyteinteractive \
obstore==0.3.0b9 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed or can we put it in a dev requirements file? if we need it, should we bump to 0.6.0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry I forgot to change this. I'll update this and pyproject.toml as well to 0.6.0

@machichima machichima changed the title [WIP] Apply obstore as storage backend Apply obstore as storage backend Apr 8, 2025
@flyte-bot
Copy link
Contributor

flyte-bot commented Apr 8, 2025

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - Bito Code Review Agent didn't review this pull request automatically because it exceeded the size limit. No action is needed if you didn't intend for the agent to review it. Otherwise, you can initiate the review by typing /review in a comment below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

5 participants