Skip to content

Multi scan API #13473

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from
Closed

Multi scan API #13473

wants to merge 16 commits into from

Conversation

anand1976
Copy link
Contributor

@anand1976 anand1976 commented Mar 19, 2025

A multi scan API for users to pass a set of scan ranges and have the table readers determine the optimal strategy for performing the scans. This might include coalescing of IOs across scans, for example. The requested scans should be in increasing key order. The scan start keys and other info is passed to NewMultiScanIterator, which in turn uses the newly added Prepare() interface in Iterator to update the iterator. The Prepare() takes a vector of ScanOptions, which contain the start keys and optional upper bounds, as well as user defined parameters in the property_bag taht are passed through as is to external table readers.

The initial implementation plumbs this through to the ExternalTableReader. This PR also fixes an issue of premature destruction of the external table iterator after the first scan of the multi-scan. The LevelIterator treats an invalid iterator as a potential end of file and destroys the table iterator in order to move to the next file. To prevent that, this PR defines the NextAndGetResult interface that the external table iterator must implement. The result returned by NextAndGetResult differentiates between iterator invalidation due to out of bound vs end of file.

Eventually, I envision the MultiScanIterator to be built on top of a producer-consumer queue like container, with RocksDB (producer) enqueueing keys and values into the container and the application (consumer) dequeueing them. Unlike a traditional producer consumer queue, there is no concurrency here. The results will be buffered in the container, and when the buffer is empty a new batch will be read from the child iterators. This will allow the virtual function call overhead to be amortized over many entries.

TODO:

  1. Update the internal implementation of Prepare to trim the ScanOptions range based on the intersection with the table key range, taking into consideration unbounded scans and opaque user defined bounds.
  2. Long term, take advantage of Prepare in BlockBasedTableIterator, atleast for the upper bound case.

@anand1976 anand1976 added the WIP Work in progress label Mar 19, 2025
@facebook-github-bot
Copy link
Contributor

@anand1976 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@anand1976 anand1976 changed the title [Draft] Multi scan API [RFC] Multi scan API Mar 19, 2025
@facebook-github-bot
Copy link
Contributor

@anand1976 has updated the pull request. You must reimport the pull request before landing.

@anand1976
Copy link
Contributor Author

@jaykorean @pdillinger I updated the PR with a modified MultiScanIterator that uses the c++ input iterator interface. Let me know if this looks cleaner.

@anand1976
Copy link
Contributor Author

Let's think through the aspect of this attempting to maximize opportunities for overlapping scans to reuse each others work. (Even if data ends up in the block cache, there can still be significant seeking work.) Imagine a concurrent tree data structure keeping track of pending latency-tolerant scan requests. We could have select read threads following a kind of elevator algorithm (always forward though) to serve up pending scan requests in key order, reusing work between overlapping requests. There should be more opportunities for reusing results vs. a multiscan API like this, because late breaking requests can still be (in a sense) coalesced with pending requests when doing work at or near their range. The downside of course is concurrency management, feeding results with minimal blocking, etc. Thoughts?

That's an interesting idea. The synchronization and co-ordination across scans in different threads would add some overhead, as well as potentially increase scan latency. Whether its a net win depends on the workload I guess. That's what I'd be most worried about.

The optimization I'm looking to enable with the multi-scan is for scans not necessarily overlapping in key range, but in pages. For example, 2 scans may include adjacent blocks that share the same 4KB page (or some multiple that we determine would benefit from coalescing). Also, I think the multiscan API is sort of orthogonal to what you propose, since its just a way to capture the user intent.

@facebook-github-bot
Copy link
Contributor

@anand1976 has updated the pull request. You must reimport the pull request before landing.

3 similar comments
@facebook-github-bot
Copy link
Contributor

@anand1976 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@anand1976 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@anand1976 has updated the pull request. You must reimport the pull request before landing.

@anand1976 anand1976 removed the WIP Work in progress label Mar 28, 2025
@facebook-github-bot
Copy link
Contributor

@anand1976 has updated the pull request. You must reimport the pull request before landing.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@anand1976 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@anand1976 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@anand1976 has updated the pull request. You must reimport the pull request before landing.

// We may add other options such as prefix scan in the future.
struct ScanOptions {
// The scan range. Mandatory for start to be set, limit is optional
RangeOpt range;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is RangeOpt the right thing to use here? Or should we define something with an optional string for start and limit? If ScanOptions is cached in the iterator, we should probably do a deep copy if using the former.

Copy link
Contributor Author

@anand1976 anand1976 Mar 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing to think about - for internal iterators, start key should be the internal key whereas RangeOpt has the user key. Maybe we need an InternalRangeOpt for internal usage while using RangeOpt in the API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is RangeOpt the right thing to use here?

Probably yes until SliceBound is available for iterator bounds. However, I'm not clear on the use case for no limit with MultiScan, at least multiple ranges with no limit.

Maybe we need an InternalRangeOpt for internal usage

Probably a good idea. Potentially relevant to some uses of MaybeAddTimestampsToRange.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I'm not clear on the use case for no limit with MultiScan, at least multiple ranges with no limit.

It could be some custom termination condition specified in property_bag (I'm thinking weight could be one of those instead of having an explicit field).

@facebook-github-bot
Copy link
Contributor

@anand1976 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@anand1976 has updated the pull request. You must reimport the pull request before landing.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@anand1976 has updated the pull request. You must reimport the pull request before landing.

@anand1976 anand1976 changed the title [RFC] Multi scan API Multi scan API Mar 31, 2025
Copy link
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the PR description could use an update.

A lot of my feedback could be delayed to later PRs, but looking for a bit of cleanup/clarification.

// We may add other options such as prefix scan in the future.
struct ScanOptions {
// The scan range. Mandatory for start to be set, limit is optional
RangeOpt range;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is RangeOpt the right thing to use here?

Probably yes until SliceBound is available for iterator bounds. However, I'm not clear on the use case for no limit with MultiScan, at least multiple ranges with no limit.

Maybe we need an InternalRangeOpt for internal usage

Probably a good idea. Potentially relevant to some uses of MaybeAddTimestampsToRange.


namespace ROCKSDB_NAMESPACE {

#if 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Just left it here for the review as an alternative API. I can remove it now.


explicit MultiScanIterator(std::unique_ptr<Iterator>&& db_iter)
: db_iter_(std::move(db_iter)) {}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe all of these Iterators wrapping a db Iterator should be made non-copyable to avoid side effects across copies.

Or maybe they do need to be copyable. Anyway, I don't want them to accidentally sometimes / maybe work with various STL algorithms when they should ideally fail to compile. Do they just satisfy LegacyInputIterator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if LegacyInputIterator needs to have a copy constructor. It should be reasonable to make it non-copyable.

return *this;
}

bool operator==(ScanIterator other) const { return idx_ == other.idx_; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this and SingleIterator, the operator== / operator!= semantics I believe are potentially hazardous. Can we take advantage of this?

As of C++17, the types of the /* begin-expr / and the / end-expr / do not have to be the same, and in fact the type of the / end-expr */ does not have to be an iterator: it just needs to be able to be compared for inequality with one.

So how about we use std::nullptr_t nullptr as the end() value for these types of iterators?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Let me try that.

void Prepare(const std::vector<ScanOptions>* scan_opts) override {
scan_opts_ = scan_opts;
if (file_iter_.iter()) {
file_iter_.Prepare(scan_opts_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far this is just plumbing that is not taken advantage of, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Main purpose is to plumb it through to the external table reader. Eventually we'll want to take advantage of it in block based table as well.

//
// If the sequence of Seeks is interrupted by seeking to some other target
// key, then the iterator is free to discard anything done during Prepare.
virtual void Prepare(const std::vector<ScanOptions>* scan_opts) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we should Prepare a range of ScanOptions rather than a vector, so that we don't have to materialize a vector of what's (potentially) relevant to a particular file.

// }
// } catch (Status s) {
// }
class MultiScanIterator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it difficult to read the giant nesting of classes below to deduce the relationship between MultiScanIterator, ScanIterator, Scan, and SingleIterator. Could you add some diagram or hierarchy or elaboration to the code example to make that clear?

Also should MultiScanIterator just be called MultiScan since it doesn't behave like an STL or RocksDB iterator?

ReadOptions ro;
std::vector<ScanOptions> scan_options(
{ScanOptions(key_ranges[0], key_ranges[1]),
ScanOptions(key_ranges[2], key_ranges[3])});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably also test the "no limit" case and cases with overlap

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you added overlap, but I don't see a "no limit" case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, forgot to git commit it.

// of key-value pairs. In the case of an ExternalTableReader, the weight is
// passed through to the table reader and the interpretation is upto the
// reader implementation.
uint64_t weight = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is weight maybe coming back in some form?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaving it to the user to specify it in the property_bag. It'll be an out of band contract between the application and the external table reader. I don't see any possible use for it inside RocksDB. My initial thought was to define it and leave the interpretation flexible, but now I'm thinking that doesn't really accomplish anything. If we ever want to implement a count based scan, we can add a specific option for it at that point.


// This structure encapsulates the result of NextAndGetResult()
struct IterateResult {
Slice key;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we have an opportunity to do something like IOBuf in the handling of keys and values. In many/most cases we will have materialized a key, and maybe in some cases a value, that we should be able to hand off to the user to make use of as long as they want without deep copying. (Perhaps "unnecessary copying" is more of a CPU overhead for scans than "unnecessary virtual calls". Hmm.)

Probably not a simple change, but perhaps something for the TODO list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we do any unnecessary copying today. We just return the value Slice and the corresponding block is pinned. Is it the "make use of as long as they want" part that you're focusing on? Because we don't guarantee the pinning once the iterator is advanced.

@facebook-github-bot
Copy link
Contributor

@anand1976 has updated the pull request. You must reimport the pull request before landing.

@anand1976 anand1976 requested a review from pdillinger April 2, 2025 17:09
Copy link
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

const ReadOptions& _read_options, ColumnFamilyHandle* column_family,
const std::vector<ScanOptions>& scan_opts) {
std::unique_ptr<Iterator> iter(NewIterator(_read_options, column_family));
iter->Prepare(scan_opts);
std::unique_ptr<MultiScanIterator> ms_iter(
new MultiScanIterator(scan_opts, std::move(iter)));
std::unique_ptr<MultiScan> ms_iter(new MultiScan(scan_opts, std::move(iter)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: prefer make_unique


// This structure encapsulates the result of NextAndGetResult()
struct IterateResult {
Slice key;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth a comment about limited lifetime of key.

const ReadOptions& /*options*/, ColumnFamilyHandle* /*column_family*/,
const std::vector<ScanOptions>& /*scan_opts*/) {
std::unique_ptr<Iterator> iter(NewErrorIterator(Status::NotSupported()));
std::unique_ptr<MultiScan> ms_iter(new MultiScan(std::move(iter)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: prefer make_unique

// ---
// |
// ScanIterator <-- std::input_iterator (returns the KVs of a single
// scan range)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍👍

ReadOptions ro;
std::vector<ScanOptions> scan_options(
{ScanOptions(key_ranges[0], key_ranges[1]),
ScanOptions(key_ranges[2], key_ranges[3])});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you added overlap, but I don't see a "no limit" case

@@ -0,0 +1,223 @@
// Copyright (c) Meta Platforms, Inc. and affiliates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: a better file name might now be multi_scan.h

@facebook-github-bot
Copy link
Contributor

@anand1976 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@anand1976 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@anand1976 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@anand1976 merged this pull request in 24e2b05.

anand1976 added a commit that referenced this pull request Apr 3, 2025
Summary:
A multi scan API for users to pass a set of scan ranges and have the table readers determine the optimal strategy for performing the scans. This might include coalescing of IOs across scans, for example. The requested scans should be in increasing key order. The scan start keys and other info is passed to NewMultiScanIterator, which in turn uses the newly added Prepare() interface in Iterator to update the iterator. The Prepare() takes a vector of ScanOptions, which contain the start keys and optional upper bounds, as well as user defined parameters in the property_bag taht are passed through as is to external table readers.

The initial implementation plumbs this through to the ExternalTableReader. This PR also fixes an issue of premature destruction of the external table iterator after the first scan of the multi-scan. The `LevelIterator` treats an invalid iterator as a potential end of file and destroys the table iterator in order to move to the next file. To prevent that, this PR defines the `NextAndGetResult` interface that the external table iterator must implement. The result returned by `NextAndGetResult` differentiates between iterator invalidation due to out of bound vs end of file.

Eventually, I envision the `MultiScanIterator` to be built on top of a producer-consumer queue like container, with RocksDB (producer) enqueueing keys and values into the container and the application (consumer) dequeueing them. Unlike a traditional producer consumer queue, there is no concurrency here. The results will be buffered in the container, and when the buffer is empty a new batch will be read from the child iterators. This will allow the virtual function call overhead to be amortized over many entries.

TODO (in future PRs):
1. Update the internal implementation of Prepare to trim the ScanOptions range based on the intersection with the table key range, taking into consideration unbounded scans and opaque user defined bounds.
2. Long term, take advantage of Prepare in BlockBasedTableIterator, atleast for the upper bound case.

Pull Request resolved: #13473

Reviewed By: pdillinger

Differential Revision: D71447559

Pulled By: anand1976

fbshipit-source-id: 31668abb0c529aa1ac1738ae46c36cbddf9148f1
ybtsdst pushed a commit to ybtsdst/rocksdb that referenced this pull request Apr 27, 2025
Summary:
A multi scan API for users to pass a set of scan ranges and have the table readers determine the optimal strategy for performing the scans. This might include coalescing of IOs across scans, for example. The requested scans should be in increasing key order. The scan start keys and other info is passed to NewMultiScanIterator, which in turn uses the newly added Prepare() interface in Iterator to update the iterator. The Prepare() takes a vector of ScanOptions, which contain the start keys and optional upper bounds, as well as user defined parameters in the property_bag taht are passed through as is to external table readers.

The initial implementation plumbs this through to the ExternalTableReader. This PR also fixes an issue of premature destruction of the external table iterator after the first scan of the multi-scan. The `LevelIterator` treats an invalid iterator as a potential end of file and destroys the table iterator in order to move to the next file. To prevent that, this PR defines the `NextAndGetResult` interface that the external table iterator must implement. The result returned by `NextAndGetResult` differentiates between iterator invalidation due to out of bound vs end of file.

Eventually, I envision the `MultiScanIterator` to be built on top of a producer-consumer queue like container, with RocksDB (producer) enqueueing keys and values into the container and the application (consumer) dequeueing them. Unlike a traditional producer consumer queue, there is no concurrency here. The results will be buffered in the container, and when the buffer is empty a new batch will be read from the child iterators. This will allow the virtual function call overhead to be amortized over many entries.

TODO (in future PRs):
1. Update the internal implementation of Prepare to trim the ScanOptions range based on the intersection with the table key range, taking into consideration unbounded scans and opaque user defined bounds.
2. Long term, take advantage of Prepare in BlockBasedTableIterator, atleast for the upper bound case.

Pull Request resolved: facebook#13473

Reviewed By: pdillinger

Differential Revision: D71447559

Pulled By: anand1976

fbshipit-source-id: 31668abb0c529aa1ac1738ae46c36cbddf9148f1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants