Description
Please describe what you are trying to do.
TLDR: let's combine forces rather than all reimplementing caching / chunking / etc in object_store
!
The ObjectStore
trait is flexible and it is common to compose a stack of ObjectStore
with one wrapping underlying stores
For example, the ThrottledStore
and LimitStore
provided with the object store crate does exactly this
┌──────────────────────────────┐
│ ThrottledStore │
│(adds user configured delays) │
└──────────────────────────────┘
▲
│
│
┌──────────────────────────────┐
│ Inner ObjectStore │
│ (for example, AmazonS3) │
└──────────────────────────────┘
Many Different Behaviors
There are many types of behaviors that can be implemented this way. Some examples I am aware of:
- The
ThrottledStore
andLimitStore
provided with the object store crate - Runs on a different tokio runtime (such as the
DeltaIOStorageBackend
in delta rs from @ion-elgreco. - Limit the total size of any individual request (e.g. the
LimitedRequestSizeObjectStore
from Timeouts reading "large" files from object stores over "slow" connections datafusion#15067) - Break single large requests into multiple concurrent small requests ("chunking") - @crepererum is working on this I think in influx
- Caches results of requests locally using memory / disk (see ObjectStoreMemCache in influxdb3_core), and this one in slatedb @criccomini (thanks @ion-elgreco for the pointer)
- Collect statistics / traces and report metrics (see ObjectStoreMetrics in influxdb3_core)
- Visualization of object store requests over time
Desired behavior is varied and application specific
Also, depending on the needs of the particular app, the ideal behavior / policy is likely different.
For example,
- In the case of Timeouts reading "large" files from object stores over "slow" connections datafusion#15067, splitting one large request into several small requests made in series is likely the desired approach (maximize the chance they succeed)
- If you are trying to maximize read bandwidth in a cloud server setting, splitting up ("Chunking") large requests into several parallel ones may be desired
- If you are trying to minimize costs (for example doing bulk reorganizations / compactions on historical data that are not latency sensitive), using a single request for large objects (what is done today) might be desired
- Maybe you want to adapt more dynamically to network and object store conditions as described in Exploiting Cloud Object Storage for High-Performance Analytics
So the point is that I don't think any one individual policy will work for all use cases (though we can certainly discuss changing the default policy)
Since ObjectStore
is already composable, I already see projects implementing these types of things independently (for example, delta-rs and influxdb_iox both have a cross runtime object stores, and @mildbyte from splitgraph implemented some sort of visualization of object store requests over time)
I believe this is similar to the OpenDAL concept of layers
but @Xuanwo please correct me if I am wrong
Desired Solution
I would like it ti be easier for users of object_store to access such features without having implement custom wrappers in parallel independently
Alternatives
New object_store_util
crate
One alternative is to make a new crate, namedobject_store_util
or similar mirroring futures-util
and tokio-util
that has a bunch of these ObjectStore combinators
This could be housed outside of the apache organization, but I think it would be most valuable for the community if it was inside
Add additional policies to provided implmenetations
An alternate is to implement a more sophisticated default implementations (for example, add more options to the AmazonS3
implementation.
One upside of this approach is it could take advantage of implementation specific features
One downside is additional code and configuration complexity, especially as the different strategies are all applicable to multiple stores (e.g. GCP, S3 and Azure). Another downside is specifying the policy might be complex (like specifying concurrency along with chunking and under what circumstances should each be used)
Additional context