Skip to content

[Discussion] Object Store Composition #16

Closed
@tustvold

Description

@tustvold

Problem

Initially the ObjectStore API was relatively simple, consisting of a few methods to interact with object stores. As such many systems took this abstraction and used it as a generic IO abstraction, this is good and what the crate was designed for.

As people wanted additional functionality, such as instrumentation, caching or concurrency limiting, this was implemented by creating ObjectStore implementations that wrap the existing ones. Again this worked well.

However, over time the ObjectStore API has grown, and now has 8 required methods and a further 10 methods with default implementations. This creates a number of challenges for this wrapper based approach for composition.

API Surface

As a wrapper must avoid "despecializing" methods, it must implement all 18 methods. Not only is this burdensome, but creates upgrade hazards as new methods are added, potentially in non-breaking versions.

Additional Context

As the logic within these wrappers has grown more complex, there comes the need to pass additional information through to this logic. This motivates requests like #17

Interface Creep

In many places the ObjectStore interface gets used as the abstraction for components that don't actually require the full breadth of ObjectStore functionality. There is no need, for example, for a parquet reader to depend on more than the ability to fetch ranges of bytes.

This leads to perverse "ObjectStore" implementations, that actually only implement say get functionality. Similarly in contexts like apache/datafusion#14286 it creates complexities around how to shim the full ObjectStore interface, despite the actual operators in question only using a very small subset of this functionality.

Request Correlation

As the ObjectStore logic has gotten more sophisticated, incorporating automatic retries, request batching, etc... the relationship between an ObjectStore method call and requests has gotten rather fuzzy. This makes implementing instrumentation, concurrency limiting, tokio task dispatch, etc... at this API boundary increasingly inaccurate/problematic.

Thoughts

I personally think we should encourage a move away from this wrapper based form of composition and instead do the following:

  • Encourage use of specialized traits like parquet's AsyncFileReader that reflect what a given component actually needs, and can evolve independently of ObjectStore
  • Add additional functionality for injecting logic into the HTTP request path (#6056) allowing
    • More accurate instrumentation
    • More accurate concurrency limiting
    • Potential sophistication w.r.t tokio runtime dispatch

I can't help feeling right now ObjectStore is stuck between trying to expose the functionality of ObjectStore's in a portable and ergonomic fashion, whilst also trying to provide some sort of generic all-purpose IO subsystem abstraction, which I'm not sure aren't incompatible goals....

Tagging @alamb @crepererum @Xuanwo @waynr @kylebarron

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions