[Discussion] Object Store Composition

**Problem**

Initially the ObjectStore API was relatively simple, consisting of a few methods to interact with object stores. As such many systems took this abstraction and used it as a generic IO abstraction, this is good and what the crate was designed for.

As people wanted additional functionality, such as instrumentation, caching or concurrency limiting, this was implemented by creating ObjectStore implementations that wrap the existing ones. Again this worked well.

However, over time the ObjectStore API has grown, and now has 8 required methods and a further 10 methods with default implementations. This creates a number of challenges for this wrapper based approach for composition.

***API Surface***

As a wrapper must avoid "despecializing" methods, it must implement all 18 methods. Not only is this burdensome, but creates upgrade hazards as new methods are added, potentially in non-breaking versions.

***Additional Context***

As the logic within these wrappers has grown more complex, there comes the need to pass additional information through to this logic. This motivates requests like apache/arrow-rs-object-store#17

***Interface Creep***

In many places the ObjectStore interface gets used as the abstraction for components that don't actually require the full breadth of ObjectStore functionality. There is no need, for example, for a parquet reader to depend on more than the ability to fetch ranges of bytes. 

This leads to perverse "ObjectStore" implementations, that actually only implement say get functionality. Similarly in contexts like https://github.com/apache/datafusion/pull/14286 it creates complexities around how to shim the full ObjectStore interface, despite the actual operators in question only using a very small subset of this functionality.

***Request Correlation***

As the ObjectStore logic has gotten more sophisticated, incorporating automatic retries, request batching, etc... the relationship between an ObjectStore method call and requests has gotten rather fuzzy. This makes implementing instrumentation, concurrency limiting, tokio task dispatch, etc... at this API boundary increasingly inaccurate/problematic.

**Thoughts**

I personally think we should encourage a move away from this wrapper based form of composition and instead do the following:

* Encourage use of specialized traits like parquet's [AsyncFileReader](https://docs.rs/parquet/latest/parquet/arrow/async_reader/trait.AsyncFileReader.html) that reflect what a given component actually needs, and can evolve independently of ObjectStore
* Add additional functionality for injecting logic into the HTTP request path (#6056) allowing
    * More accurate instrumentation
    * More accurate concurrency limiting
    * Potential sophistication w.r.t tokio runtime dispatch

I can't help feeling right now ObjectStore is stuck between trying to expose the functionality of ObjectStore's in a portable and ergonomic fashion, whilst also trying to provide some sort of generic all-purpose IO subsystem abstraction, which I'm not sure aren't incompatible goals....

Tagging @alamb @crepererum @Xuanwo @waynr @kylebarron 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Discussion] Object Store Composition #16

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Discussion] Object Store Composition #16

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions