Multi scan API #13473

anand1976 · 2025-03-19T02:47:36Z

A multi scan API for users to pass a set of scan ranges and have the table readers determine the optimal strategy for performing the scans. This might include coalescing of IOs across scans, for example. The requested scans should be in increasing key order. The scan start keys and other info is passed to NewMultiScanIterator, which in turn uses the newly added Prepare() interface in Iterator to update the iterator. The Prepare() takes a vector of ScanOptions, which contain the start keys and optional upper bounds, as well as user defined parameters in the property_bag taht are passed through as is to external table readers.

The initial implementation plumbs this through to the ExternalTableReader. This PR also fixes an issue of premature destruction of the external table iterator after the first scan of the multi-scan. The LevelIterator treats an invalid iterator as a potential end of file and destroys the table iterator in order to move to the next file. To prevent that, this PR defines the NextAndGetResult interface that the external table iterator must implement. The result returned by NextAndGetResult differentiates between iterator invalidation due to out of bound vs end of file.

Eventually, I envision the MultiScanIterator to be built on top of a producer-consumer queue like container, with RocksDB (producer) enqueueing keys and values into the container and the application (consumer) dequeueing them. Unlike a traditional producer consumer queue, there is no concurrency here. The results will be buffered in the container, and when the buffer is empty a new batch will be read from the child iterators. This will allow the virtual function call overhead to be amortized over many entries.

TODO:

Update the internal implementation of Prepare to trim the ScanOptions range based on the intersection with the table key range, taking into consideration unbounded scans and opaque user defined bounds.
Long term, take advantage of Prepare in BlockBasedTableIterator, atleast for the upper bound case.

facebook-github-bot · 2025-03-19T03:43:12Z

@anand1976 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

include/rocksdb/db.h

include/rocksdb/multi_scan_iterator.h

facebook-github-bot · 2025-03-25T06:25:52Z

@anand1976 has updated the pull request. You must reimport the pull request before landing.

anand1976 · 2025-03-25T16:56:17Z

@jaykorean @pdillinger I updated the PR with a modified MultiScanIterator that uses the c++ input iterator interface. Let me know if this looks cleaner.

anand1976 · 2025-03-25T17:08:45Z

Let's think through the aspect of this attempting to maximize opportunities for overlapping scans to reuse each others work. (Even if data ends up in the block cache, there can still be significant seeking work.) Imagine a concurrent tree data structure keeping track of pending latency-tolerant scan requests. We could have select read threads following a kind of elevator algorithm (always forward though) to serve up pending scan requests in key order, reusing work between overlapping requests. There should be more opportunities for reusing results vs. a multiscan API like this, because late breaking requests can still be (in a sense) coalesced with pending requests when doing work at or near their range. The downside of course is concurrency management, feeding results with minimal blocking, etc. Thoughts?

That's an interesting idea. The synchronization and co-ordination across scans in different threads would add some overhead, as well as potentially increase scan latency. Whether its a net win depends on the workload I guess. That's what I'd be most worried about.

The optimization I'm looking to enable with the multi-scan is for scans not necessarily overlapping in key range, but in pages. For example, 2 scans may include adjacent blocks that share the same 4KB page (or some multiple that we determine would benefit from coalescing). Also, I think the multiscan API is sort of orthogonal to what you propose, since its just a way to capture the user intent.

include/rocksdb/db.h

include/rocksdb/multi_scan_iterator.h

facebook-github-bot · 2025-03-27T19:59:11Z

@anand1976 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2025-03-28T00:08:36Z

@anand1976 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2025-03-28T05:30:31Z

@anand1976 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2025-03-28T05:40:00Z

@anand1976 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2025-03-28T05:41:36Z

@anand1976 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2025-03-28T05:54:06Z

@anand1976 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2025-03-28T16:06:12Z

@anand1976 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

include/rocksdb/options.h

table/table_test.cc

include/rocksdb/options.h

facebook-github-bot · 2025-03-29T19:51:04Z

@anand1976 has updated the pull request. You must reimport the pull request before landing.

anand1976 · 2025-03-29T20:20:21Z

include/rocksdb/options.h

+// We may add other options such as prefix scan in the future.
+struct ScanOptions {
+  // The scan range. Mandatory for start to be set, limit is optional
+  RangeOpt range;


Is RangeOpt the right thing to use here? Or should we define something with an optional string for start and limit? If ScanOptions is cached in the iterator, we should probably do a deep copy if using the former.

Another thing to think about - for internal iterators, start key should be the internal key whereas RangeOpt has the user key. Maybe we need an InternalRangeOpt for internal usage while using RangeOpt in the API.

Is RangeOpt the right thing to use here?

Probably yes until SliceBound is available for iterator bounds. However, I'm not clear on the use case for no limit with MultiScan, at least multiple ranges with no limit.

Maybe we need an InternalRangeOpt for internal usage

Probably a good idea. Potentially relevant to some uses of MaybeAddTimestampsToRange.

However, I'm not clear on the use case for no limit with MultiScan, at least multiple ranges with no limit.

It could be some custom termination condition specified in property_bag (I'm thinking weight could be one of those instead of having an explicit field).

facebook-github-bot · 2025-03-29T20:33:44Z

@anand1976 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2025-03-31T21:32:08Z

@anand1976 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2025-03-31T21:35:36Z

@anand1976 has updated the pull request. You must reimport the pull request before landing.

pdillinger

I believe the PR description could use an update.

A lot of my feedback could be delayed to later PRs, but looking for a bit of cleanup/clarification.

pdillinger · 2025-04-01T17:13:39Z

include/rocksdb/options.h

+// We may add other options such as prefix scan in the future.
+struct ScanOptions {
+  // The scan range. Mandatory for start to be set, limit is optional
+  RangeOpt range;


Is RangeOpt the right thing to use here?

Probably yes until SliceBound is available for iterator bounds. However, I'm not clear on the use case for no limit with MultiScan, at least multiple ranges with no limit.

Maybe we need an InternalRangeOpt for internal usage

Probably a good idea. Potentially relevant to some uses of MaybeAddTimestampsToRange.

pdillinger · 2025-04-01T17:14:05Z

include/rocksdb/multi_scan_iterator.h

+
+namespace ROCKSDB_NAMESPACE {
+
+#if 0


Yes. Just left it here for the review as an alternative API. I can remove it now.

pdillinger · 2025-04-01T17:30:54Z

include/rocksdb/multi_scan_iterator.h

+
+  explicit MultiScanIterator(std::unique_ptr<Iterator>&& db_iter)
+      : db_iter_(std::move(db_iter)) {}
+


I believe all of these Iterators wrapping a db Iterator should be made non-copyable to avoid side effects across copies.

Or maybe they do need to be copyable. Anyway, I don't want them to accidentally sometimes / maybe work with various STL algorithms when they should ideally fail to compile. Do they just satisfy LegacyInputIterator?

I don't know if LegacyInputIterator needs to have a copy constructor. It should be reasonable to make it non-copyable.

pdillinger · 2025-04-01T17:32:13Z

include/rocksdb/multi_scan_iterator.h

+      return *this;
+    }
+
+    bool operator==(ScanIterator other) const { return idx_ == other.idx_; }


For this and SingleIterator, the operator== / operator!= semantics I believe are potentially hazardous. Can we take advantage of this?

As of C++17, the types of the /* begin-expr / and the / end-expr / do not have to be the same, and in fact the type of the / end-expr */ does not have to be an iterator: it just needs to be able to be compared for inequality with one.

So how about we use std::nullptr_t nullptr as the end() value for these types of iterators?

Interesting. Let me try that.

pdillinger · 2025-04-01T17:49:50Z

db/version_set.cc

+  void Prepare(const std::vector<ScanOptions>* scan_opts) override {
+    scan_opts_ = scan_opts;
+    if (file_iter_.iter()) {
+      file_iter_.Prepare(scan_opts_);


So far this is just plumbing that is not taken advantage of, correct?

Yes. Main purpose is to plumb it through to the external table reader. Eventually we'll want to take advantage of it in block based table as well.

pdillinger · 2025-04-01T17:53:01Z

include/rocksdb/external_table.h

+  //
+  // If the sequence of Seeks is interrupted by seeking to some other target
+  // key, then the iterator is free to discard anything done during Prepare.
+  virtual void Prepare(const std::vector<ScanOptions>* scan_opts) = 0;


It seems like we should Prepare a range of ScanOptions rather than a vector, so that we don't have to materialize a vector of what's (potentially) relevant to a particular file.

pdillinger · 2025-04-01T18:00:02Z

include/rocksdb/multi_scan_iterator.h

+//    }
+//  } catch (Status s) {
+//  }
+class MultiScanIterator {


I found it difficult to read the giant nesting of classes below to deduce the relationship between MultiScanIterator, ScanIterator, Scan, and SingleIterator. Could you add some diagram or hierarchy or elaboration to the code example to make that clear?

Also should MultiScanIterator just be called MultiScan since it doesn't behave like an STL or RocksDB iterator?

pdillinger · 2025-04-01T18:00:36Z

table/table_test.cc

+  ReadOptions ro;
+  std::vector<ScanOptions> scan_options(
+      {ScanOptions(key_ranges[0], key_ranges[1]),
+       ScanOptions(key_ranges[2], key_ranges[3])});


Should probably also test the "no limit" case and cases with overlap

I see you added overlap, but I don't see a "no limit" case

Sorry, forgot to git commit it.

pdillinger · 2025-04-01T18:02:33Z

include/rocksdb/options.h

-  // of key-value pairs. In the case of an ExternalTableReader, the weight is
-  // passed through to the table reader and the interpretation is upto the
-  // reader implementation.
-  uint64_t weight = 0;


Is weight maybe coming back in some form?

I'm leaving it to the user to specify it in the property_bag. It'll be an out of band contract between the application and the external table reader. I don't see any possible use for it inside RocksDB. My initial thought was to define it and leave the interpretation flexible, but now I'm thinking that doesn't really accomplish anything. If we ever want to implement a count based scan, we can add a specific option for it at that point.

pdillinger · 2025-04-01T18:10:00Z

include/rocksdb/advanced_iterator.h

+
+// This structure encapsulates the result of NextAndGetResult()
+struct IterateResult {
+  Slice key;


It seems like we have an opportunity to do something like IOBuf in the handling of keys and values. In many/most cases we will have materialized a key, and maybe in some cases a value, that we should be able to hand off to the user to make use of as long as they want without deep copying. (Perhaps "unnecessary copying" is more of a CPU overhead for scans than "unnecessary virtual calls". Hmm.)

Probably not a simple change, but perhaps something for the TODO list.

I don't think we do any unnecessary copying today. We just return the value Slice and the corresponding block is pinned. Is it the "make use of as long as they want" part that you're focusing on? Because we don't guarantee the pinning once the iterator is advanced.

facebook-github-bot · 2025-04-02T06:49:59Z

@anand1976 has updated the pull request. You must reimport the pull request before landing.

pdillinger

Overall LGTM

pdillinger · 2025-04-02T17:31:56Z

db/db_impl/db_impl.cc

    const ReadOptions& _read_options, ColumnFamilyHandle* column_family,
    const std::vector<ScanOptions>& scan_opts) {
  std::unique_ptr<Iterator> iter(NewIterator(_read_options, column_family));
  iter->Prepare(scan_opts);
-  std::unique_ptr<MultiScanIterator> ms_iter(
-      new MultiScanIterator(scan_opts, std::move(iter)));
+  std::unique_ptr<MultiScan> ms_iter(new MultiScan(scan_opts, std::move(iter)));


Nit: prefer make_unique

pdillinger · 2025-04-02T17:42:51Z

include/rocksdb/advanced_iterator.h

+
+// This structure encapsulates the result of NextAndGetResult()
+struct IterateResult {
+  Slice key;


Probably worth a comment about limited lifetime of key.

pdillinger · 2025-04-02T17:43:06Z

include/rocksdb/db.h

+      const ReadOptions& /*options*/, ColumnFamilyHandle* /*column_family*/,
+      const std::vector<ScanOptions>& /*scan_opts*/) {
+    std::unique_ptr<Iterator> iter(NewErrorIterator(Status::NotSupported()));
+    std::unique_ptr<MultiScan> ms_iter(new MultiScan(std::move(iter)));


Nit: prefer make_unique

pdillinger · 2025-04-02T17:44:07Z

include/rocksdb/multi_scan_iterator.h

+//            ---
+//              |
+//          ScanIterator <-- std::input_iterator (returns the KVs of a single
+//                                                scan range)


pdillinger · 2025-04-02T17:49:35Z

table/table_test.cc

+  ReadOptions ro;
+  std::vector<ScanOptions> scan_options(
+      {ScanOptions(key_ranges[0], key_ranges[1]),
+       ScanOptions(key_ranges[2], key_ranges[3])});


I see you added overlap, but I don't see a "no limit" case

pdillinger · 2025-04-02T17:49:55Z

include/rocksdb/multi_scan_iterator.h

@@ -0,0 +1,223 @@
+//  Copyright (c) Meta Platforms, Inc. and affiliates.


Nit: a better file name might now be multi_scan.h

facebook-github-bot · 2025-04-02T17:53:03Z

@anand1976 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2025-04-02T21:22:42Z

@anand1976 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2025-04-02T21:23:30Z

@anand1976 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2025-04-02T23:11:01Z

@anand1976 merged this pull request in 24e2b05.

Summary: A multi scan API for users to pass a set of scan ranges and have the table readers determine the optimal strategy for performing the scans. This might include coalescing of IOs across scans, for example. The requested scans should be in increasing key order. The scan start keys and other info is passed to NewMultiScanIterator, which in turn uses the newly added Prepare() interface in Iterator to update the iterator. The Prepare() takes a vector of ScanOptions, which contain the start keys and optional upper bounds, as well as user defined parameters in the property_bag taht are passed through as is to external table readers. The initial implementation plumbs this through to the ExternalTableReader. This PR also fixes an issue of premature destruction of the external table iterator after the first scan of the multi-scan. The `LevelIterator` treats an invalid iterator as a potential end of file and destroys the table iterator in order to move to the next file. To prevent that, this PR defines the `NextAndGetResult` interface that the external table iterator must implement. The result returned by `NextAndGetResult` differentiates between iterator invalidation due to out of bound vs end of file. Eventually, I envision the `MultiScanIterator` to be built on top of a producer-consumer queue like container, with RocksDB (producer) enqueueing keys and values into the container and the application (consumer) dequeueing them. Unlike a traditional producer consumer queue, there is no concurrency here. The results will be buffered in the container, and when the buffer is empty a new batch will be read from the child iterators. This will allow the virtual function call overhead to be amortized over many entries. TODO (in future PRs): 1. Update the internal implementation of Prepare to trim the ScanOptions range based on the intersection with the table key range, taking into consideration unbounded scans and opaque user defined bounds. 2. Long term, take advantage of Prepare in BlockBasedTableIterator, atleast for the upper bound case. Pull Request resolved: #13473 Reviewed By: pdillinger Differential Revision: D71447559 Pulled By: anand1976 fbshipit-source-id: 31668abb0c529aa1ac1738ae46c36cbddf9148f1

Summary: A multi scan API for users to pass a set of scan ranges and have the table readers determine the optimal strategy for performing the scans. This might include coalescing of IOs across scans, for example. The requested scans should be in increasing key order. The scan start keys and other info is passed to NewMultiScanIterator, which in turn uses the newly added Prepare() interface in Iterator to update the iterator. The Prepare() takes a vector of ScanOptions, which contain the start keys and optional upper bounds, as well as user defined parameters in the property_bag taht are passed through as is to external table readers. The initial implementation plumbs this through to the ExternalTableReader. This PR also fixes an issue of premature destruction of the external table iterator after the first scan of the multi-scan. The `LevelIterator` treats an invalid iterator as a potential end of file and destroys the table iterator in order to move to the next file. To prevent that, this PR defines the `NextAndGetResult` interface that the external table iterator must implement. The result returned by `NextAndGetResult` differentiates between iterator invalidation due to out of bound vs end of file. Eventually, I envision the `MultiScanIterator` to be built on top of a producer-consumer queue like container, with RocksDB (producer) enqueueing keys and values into the container and the application (consumer) dequeueing them. Unlike a traditional producer consumer queue, there is no concurrency here. The results will be buffered in the container, and when the buffer is empty a new batch will be read from the child iterators. This will allow the virtual function call overhead to be amortized over many entries. TODO (in future PRs): 1. Update the internal implementation of Prepare to trim the ScanOptions range based on the intersection with the table key range, taking into consideration unbounded scans and opaque user defined bounds. 2. Long term, take advantage of Prepare in BlockBasedTableIterator, atleast for the upper bound case. Pull Request resolved: facebook#13473 Reviewed By: pdillinger Differential Revision: D71447559 Pulled By: anand1976 fbshipit-source-id: 31668abb0c529aa1ac1738ae46c36cbddf9148f1

anand1976 added the WIP Work in progress label Mar 19, 2025

facebook-github-bot added the CLA Signed label Mar 19, 2025

anand1976 changed the title ~~[Draft] Multi scan API~~ [RFC] Multi scan API Mar 19, 2025

anand1976 requested review from pdillinger and jaykorean March 19, 2025 23:58

jaykorean reviewed Mar 20, 2025

View reviewed changes

include/rocksdb/db.h Outdated Show resolved Hide resolved

include/rocksdb/multi_scan_iterator.h Outdated Show resolved Hide resolved

include/rocksdb/multi_scan_iterator.h Outdated Show resolved Hide resolved

anand1976 force-pushed the multi_scan branch from ce0f34e to e16fb8f Compare March 25, 2025 06:25

jaykorean reviewed Mar 25, 2025

View reviewed changes

include/rocksdb/db.h Outdated Show resolved Hide resolved

include/rocksdb/multi_scan_iterator.h Outdated Show resolved Hide resolved

include/rocksdb/multi_scan_iterator.h Outdated Show resolved Hide resolved

include/rocksdb/multi_scan_iterator.h Outdated Show resolved Hide resolved

anand1976 removed the WIP Work in progress label Mar 28, 2025

pdillinger reviewed Mar 28, 2025

View reviewed changes

include/rocksdb/options.h Outdated Show resolved Hide resolved

table/table_test.cc Show resolved Hide resolved

table/table_test.cc Outdated Show resolved Hide resolved

include/rocksdb/options.h Outdated Show resolved Hide resolved

anand1976 force-pushed the multi_scan branch from f136482 to f7ae189 Compare March 29, 2025 19:50

anand1976 commented Mar 29, 2025

View reviewed changes

anand1976 changed the title ~~[RFC] Multi scan API~~ Multi scan API Mar 31, 2025

pdillinger requested changes Apr 1, 2025

View reviewed changes

anand1976 requested a review from pdillinger April 2, 2025 17:09

pdillinger approved these changes Apr 2, 2025

View reviewed changes

anand1976 added 16 commits April 2, 2025 14:22

Draft multi scan API

82b9078

MultiScanIterator variant using C++ iterators

5083220

Rebase

19685b2

Plumb multi-scan through to the external table reader

cfeffc3

Add missing header

5bc124e

Update comments and fix warnings

b6f378b

Fix warnings and formatting

336e84f

Fix formatting

b847bc2

Address code review comments

4e7f772

Update the NewMultiScanIterator interface

2008efe

Fix an uninitialized variable warning

117fd54

Another try at fixing uninitialized variable warning

9e8ed31

Address latest comments

e9eb198

Update ExternalTableIterator::Prepare interface

9850ac5

Add a no limit test case

3634697

Some cleanup

779914f

anand1976 force-pushed the multi_scan branch from edbc481 to 779914f Compare April 2, 2025 21:22

facebook-github-bot closed this in 24e2b05 Apr 2, 2025

facebook-github-bot added the Merged label Apr 2, 2025

anand1976 mentioned this pull request Apr 11, 2025

Some MultiScan code cleanup #13530

Closed

cbi42 mentioned this pull request May 16, 2025

propagate request_id from app -> Rocks -> FS #13616

Closed


		explicit MultiScanIterator(std::unique_ptr<Iterator>&& db_iter)
		: db_iter_(std::move(db_iter)) {}

		@@ -0,0 +1,223 @@
		// Copyright (c) Meta Platforms, Inc. and affiliates.


		namespace ROCKSDB_NAMESPACE {

		#if 0

Multi scan API #13473

Multi scan API #13473

Uh oh!

Conversation

anand1976 commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Mar 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Mar 25, 2025

Uh oh!

anand1976 commented Mar 25, 2025

Uh oh!

anand1976 commented Mar 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Mar 27, 2025

Uh oh!

facebook-github-bot commented Mar 28, 2025

Uh oh!

facebook-github-bot commented Mar 28, 2025

Uh oh!

facebook-github-bot commented Mar 28, 2025

Uh oh!

facebook-github-bot commented Mar 28, 2025

Uh oh!

facebook-github-bot commented Mar 28, 2025

Uh oh!

facebook-github-bot commented Mar 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Mar 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anand1976 Mar 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Mar 29, 2025

Uh oh!

facebook-github-bot commented Mar 31, 2025

Uh oh!

facebook-github-bot commented Mar 31, 2025

Uh oh!

pdillinger left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

anand1976 commented Mar 19, 2025 •

edited

Loading

anand1976 Mar 30, 2025 •

edited

Loading