-
Notifications
You must be signed in to change notification settings - Fork 6.5k
Multi scan API #13473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi scan API #13473
Conversation
@anand1976 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@anand1976 has updated the pull request. You must reimport the pull request before landing. |
@jaykorean @pdillinger I updated the PR with a modified |
That's an interesting idea. The synchronization and co-ordination across scans in different threads would add some overhead, as well as potentially increase scan latency. Whether its a net win depends on the workload I guess. That's what I'd be most worried about. The optimization I'm looking to enable with the multi-scan is for scans not necessarily overlapping in key range, but in pages. For example, 2 scans may include adjacent blocks that share the same 4KB page (or some multiple that we determine would benefit from coalescing). Also, I think the multiscan API is sort of orthogonal to what you propose, since its just a way to capture the user intent. |
@anand1976 has updated the pull request. You must reimport the pull request before landing. |
3 similar comments
@anand1976 has updated the pull request. You must reimport the pull request before landing. |
@anand1976 has updated the pull request. You must reimport the pull request before landing. |
@anand1976 has updated the pull request. You must reimport the pull request before landing. |
@anand1976 has updated the pull request. You must reimport the pull request before landing. |
1 similar comment
@anand1976 has updated the pull request. You must reimport the pull request before landing. |
@anand1976 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@anand1976 has updated the pull request. You must reimport the pull request before landing. |
// We may add other options such as prefix scan in the future. | ||
struct ScanOptions { | ||
// The scan range. Mandatory for start to be set, limit is optional | ||
RangeOpt range; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is RangeOpt the right thing to use here? Or should we define something with an optional string for start and limit? If ScanOptions is cached in the iterator, we should probably do a deep copy if using the former.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing to think about - for internal iterators, start key should be the internal key whereas RangeOpt has the user key. Maybe we need an InternalRangeOpt for internal usage while using RangeOpt in the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is RangeOpt the right thing to use here?
Probably yes until SliceBound is available for iterator bounds. However, I'm not clear on the use case for no limit with MultiScan, at least multiple ranges with no limit.
Maybe we need an InternalRangeOpt for internal usage
Probably a good idea. Potentially relevant to some uses of MaybeAddTimestampsToRange.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I'm not clear on the use case for no limit with MultiScan, at least multiple ranges with no limit.
It could be some custom termination condition specified in property_bag (I'm thinking weight could be one of those instead of having an explicit field).
@anand1976 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@anand1976 has updated the pull request. You must reimport the pull request before landing. |
1 similar comment
@anand1976 has updated the pull request. You must reimport the pull request before landing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the PR description could use an update.
A lot of my feedback could be delayed to later PRs, but looking for a bit of cleanup/clarification.
// We may add other options such as prefix scan in the future. | ||
struct ScanOptions { | ||
// The scan range. Mandatory for start to be set, limit is optional | ||
RangeOpt range; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is RangeOpt the right thing to use here?
Probably yes until SliceBound is available for iterator bounds. However, I'm not clear on the use case for no limit with MultiScan, at least multiple ranges with no limit.
Maybe we need an InternalRangeOpt for internal usage
Probably a good idea. Potentially relevant to some uses of MaybeAddTimestampsToRange.
|
||
namespace ROCKSDB_NAMESPACE { | ||
|
||
#if 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Just left it here for the review as an alternative API. I can remove it now.
|
||
explicit MultiScanIterator(std::unique_ptr<Iterator>&& db_iter) | ||
: db_iter_(std::move(db_iter)) {} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe all of these Iterators wrapping a db Iterator should be made non-copyable to avoid side effects across copies.
Or maybe they do need to be copyable. Anyway, I don't want them to accidentally sometimes / maybe work with various STL algorithms when they should ideally fail to compile. Do they just satisfy LegacyInputIterator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if LegacyInputIterator needs to have a copy constructor. It should be reasonable to make it non-copyable.
return *this; | ||
} | ||
|
||
bool operator==(ScanIterator other) const { return idx_ == other.idx_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this and SingleIterator, the operator==
/ operator!=
semantics I believe are potentially hazardous. Can we take advantage of this?
As of C++17, the types of the /* begin-expr / and the / end-expr / do not have to be the same, and in fact the type of the / end-expr */ does not have to be an iterator: it just needs to be able to be compared for inequality with one.
So how about we use std::nullptr_t nullptr
as the end()
value for these types of iterators?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Let me try that.
void Prepare(const std::vector<ScanOptions>* scan_opts) override { | ||
scan_opts_ = scan_opts; | ||
if (file_iter_.iter()) { | ||
file_iter_.Prepare(scan_opts_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far this is just plumbing that is not taken advantage of, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Main purpose is to plumb it through to the external table reader. Eventually we'll want to take advantage of it in block based table as well.
include/rocksdb/external_table.h
Outdated
// | ||
// If the sequence of Seeks is interrupted by seeking to some other target | ||
// key, then the iterator is free to discard anything done during Prepare. | ||
virtual void Prepare(const std::vector<ScanOptions>* scan_opts) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we should Prepare a range of ScanOptions rather than a vector, so that we don't have to materialize a vector of what's (potentially) relevant to a particular file.
// } | ||
// } catch (Status s) { | ||
// } | ||
class MultiScanIterator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found it difficult to read the giant nesting of classes below to deduce the relationship between MultiScanIterator, ScanIterator, Scan, and SingleIterator. Could you add some diagram or hierarchy or elaboration to the code example to make that clear?
Also should MultiScanIterator
just be called MultiScan
since it doesn't behave like an STL or RocksDB iterator?
ReadOptions ro; | ||
std::vector<ScanOptions> scan_options( | ||
{ScanOptions(key_ranges[0], key_ranges[1]), | ||
ScanOptions(key_ranges[2], key_ranges[3])}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably also test the "no limit" case and cases with overlap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you added overlap, but I don't see a "no limit" case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, forgot to git commit it.
// of key-value pairs. In the case of an ExternalTableReader, the weight is | ||
// passed through to the table reader and the interpretation is upto the | ||
// reader implementation. | ||
uint64_t weight = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is weight
maybe coming back in some form?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaving it to the user to specify it in the property_bag. It'll be an out of band contract between the application and the external table reader. I don't see any possible use for it inside RocksDB. My initial thought was to define it and leave the interpretation flexible, but now I'm thinking that doesn't really accomplish anything. If we ever want to implement a count based scan, we can add a specific option for it at that point.
|
||
// This structure encapsulates the result of NextAndGetResult() | ||
struct IterateResult { | ||
Slice key; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we have an opportunity to do something like IOBuf in the handling of keys and values. In many/most cases we will have materialized a key, and maybe in some cases a value, that we should be able to hand off to the user to make use of as long as they want without deep copying. (Perhaps "unnecessary copying" is more of a CPU overhead for scans than "unnecessary virtual calls". Hmm.)
Probably not a simple change, but perhaps something for the TODO list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we do any unnecessary copying today. We just return the value Slice and the corresponding block is pinned. Is it the "make use of as long as they want" part that you're focusing on? Because we don't guarantee the pinning once the iterator is advanced.
@anand1976 has updated the pull request. You must reimport the pull request before landing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM
db/db_impl/db_impl.cc
Outdated
const ReadOptions& _read_options, ColumnFamilyHandle* column_family, | ||
const std::vector<ScanOptions>& scan_opts) { | ||
std::unique_ptr<Iterator> iter(NewIterator(_read_options, column_family)); | ||
iter->Prepare(scan_opts); | ||
std::unique_ptr<MultiScanIterator> ms_iter( | ||
new MultiScanIterator(scan_opts, std::move(iter))); | ||
std::unique_ptr<MultiScan> ms_iter(new MultiScan(scan_opts, std::move(iter))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: prefer make_unique
|
||
// This structure encapsulates the result of NextAndGetResult() | ||
struct IterateResult { | ||
Slice key; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth a comment about limited lifetime of key.
include/rocksdb/db.h
Outdated
const ReadOptions& /*options*/, ColumnFamilyHandle* /*column_family*/, | ||
const std::vector<ScanOptions>& /*scan_opts*/) { | ||
std::unique_ptr<Iterator> iter(NewErrorIterator(Status::NotSupported())); | ||
std::unique_ptr<MultiScan> ms_iter(new MultiScan(std::move(iter))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: prefer make_unique
// --- | ||
// | | ||
// ScanIterator <-- std::input_iterator (returns the KVs of a single | ||
// scan range) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍👍
ReadOptions ro; | ||
std::vector<ScanOptions> scan_options( | ||
{ScanOptions(key_ranges[0], key_ranges[1]), | ||
ScanOptions(key_ranges[2], key_ranges[3])}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you added overlap, but I don't see a "no limit" case
@@ -0,0 +1,223 @@ | |||
// Copyright (c) Meta Platforms, Inc. and affiliates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: a better file name might now be multi_scan.h
@anand1976 has updated the pull request. You must reimport the pull request before landing. |
@anand1976 has updated the pull request. You must reimport the pull request before landing. |
@anand1976 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@anand1976 merged this pull request in 24e2b05. |
Summary: A multi scan API for users to pass a set of scan ranges and have the table readers determine the optimal strategy for performing the scans. This might include coalescing of IOs across scans, for example. The requested scans should be in increasing key order. The scan start keys and other info is passed to NewMultiScanIterator, which in turn uses the newly added Prepare() interface in Iterator to update the iterator. The Prepare() takes a vector of ScanOptions, which contain the start keys and optional upper bounds, as well as user defined parameters in the property_bag taht are passed through as is to external table readers. The initial implementation plumbs this through to the ExternalTableReader. This PR also fixes an issue of premature destruction of the external table iterator after the first scan of the multi-scan. The `LevelIterator` treats an invalid iterator as a potential end of file and destroys the table iterator in order to move to the next file. To prevent that, this PR defines the `NextAndGetResult` interface that the external table iterator must implement. The result returned by `NextAndGetResult` differentiates between iterator invalidation due to out of bound vs end of file. Eventually, I envision the `MultiScanIterator` to be built on top of a producer-consumer queue like container, with RocksDB (producer) enqueueing keys and values into the container and the application (consumer) dequeueing them. Unlike a traditional producer consumer queue, there is no concurrency here. The results will be buffered in the container, and when the buffer is empty a new batch will be read from the child iterators. This will allow the virtual function call overhead to be amortized over many entries. TODO (in future PRs): 1. Update the internal implementation of Prepare to trim the ScanOptions range based on the intersection with the table key range, taking into consideration unbounded scans and opaque user defined bounds. 2. Long term, take advantage of Prepare in BlockBasedTableIterator, atleast for the upper bound case. Pull Request resolved: #13473 Reviewed By: pdillinger Differential Revision: D71447559 Pulled By: anand1976 fbshipit-source-id: 31668abb0c529aa1ac1738ae46c36cbddf9148f1
Summary: A multi scan API for users to pass a set of scan ranges and have the table readers determine the optimal strategy for performing the scans. This might include coalescing of IOs across scans, for example. The requested scans should be in increasing key order. The scan start keys and other info is passed to NewMultiScanIterator, which in turn uses the newly added Prepare() interface in Iterator to update the iterator. The Prepare() takes a vector of ScanOptions, which contain the start keys and optional upper bounds, as well as user defined parameters in the property_bag taht are passed through as is to external table readers. The initial implementation plumbs this through to the ExternalTableReader. This PR also fixes an issue of premature destruction of the external table iterator after the first scan of the multi-scan. The `LevelIterator` treats an invalid iterator as a potential end of file and destroys the table iterator in order to move to the next file. To prevent that, this PR defines the `NextAndGetResult` interface that the external table iterator must implement. The result returned by `NextAndGetResult` differentiates between iterator invalidation due to out of bound vs end of file. Eventually, I envision the `MultiScanIterator` to be built on top of a producer-consumer queue like container, with RocksDB (producer) enqueueing keys and values into the container and the application (consumer) dequeueing them. Unlike a traditional producer consumer queue, there is no concurrency here. The results will be buffered in the container, and when the buffer is empty a new batch will be read from the child iterators. This will allow the virtual function call overhead to be amortized over many entries. TODO (in future PRs): 1. Update the internal implementation of Prepare to trim the ScanOptions range based on the intersection with the table key range, taking into consideration unbounded scans and opaque user defined bounds. 2. Long term, take advantage of Prepare in BlockBasedTableIterator, atleast for the upper bound case. Pull Request resolved: facebook#13473 Reviewed By: pdillinger Differential Revision: D71447559 Pulled By: anand1976 fbshipit-source-id: 31668abb0c529aa1ac1738ae46c36cbddf9148f1
A multi scan API for users to pass a set of scan ranges and have the table readers determine the optimal strategy for performing the scans. This might include coalescing of IOs across scans, for example. The requested scans should be in increasing key order. The scan start keys and other info is passed to NewMultiScanIterator, which in turn uses the newly added Prepare() interface in Iterator to update the iterator. The Prepare() takes a vector of ScanOptions, which contain the start keys and optional upper bounds, as well as user defined parameters in the property_bag taht are passed through as is to external table readers.
The initial implementation plumbs this through to the ExternalTableReader. This PR also fixes an issue of premature destruction of the external table iterator after the first scan of the multi-scan. The
LevelIterator
treats an invalid iterator as a potential end of file and destroys the table iterator in order to move to the next file. To prevent that, this PR defines theNextAndGetResult
interface that the external table iterator must implement. The result returned byNextAndGetResult
differentiates between iterator invalidation due to out of bound vs end of file.Eventually, I envision the
MultiScanIterator
to be built on top of a producer-consumer queue like container, with RocksDB (producer) enqueueing keys and values into the container and the application (consumer) dequeueing them. Unlike a traditional producer consumer queue, there is no concurrency here. The results will be buffered in the container, and when the buffer is empty a new batch will be read from the child iterators. This will allow the virtual function call overhead to be amortized over many entries.TODO: