-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Multi scan API #13473
base: main
Are you sure you want to change the base?
[RFC] Multi scan API #13473
Conversation
@anand1976 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
const ReadOptions& /*options*/, const std::vector<ScanDesc>& scans) { | ||
std::unique_ptr<Iterator> iter(NewErrorIterator(Status::NotSupported())); | ||
std::unique_ptr<MultiScanIterator> ms_iter( | ||
new MultiScanIterator(scans, std::move(iter))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are passing in the scans as const &
, so I assume that we expect these ScanDescriptors to be available during the time of iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm treating it like the ReadOptions
. Lifetime is not guaranteed for the life of the iterator, so the iterator should cache them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: wondering ScanOptions
can be a better name.
}; | ||
|
||
// An iterator that returns results from multiple scan ranges. The ranges are | ||
// expected to be in increasing sorted order. The application on top of RocksDB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ranges are expected to be in increasing sorted order".
I guess the ScanDescriptors as input don't need to be in sorted order, but MultiScanIterator
will give the result in sorted order for the user? Asking because "foo" > "bar".
If the DB has keys ["bar", "baz", "foo", "quux"]
In your example, the iteration will give us "bar" -> "baz" -> "foo" -> "quux" -> "foo" -> "quux"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I'll fix the mistake.
NextScan(); | ||
} | ||
|
||
void Dequeue(Slice& key, PinnableSlice& value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aren't retrieving from queued items, so I feel there could be a better name than Dequeue()
, but cannot think of one. Next(key, value)
doesn't sound right either. Wondering what @pdillinger thinks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I welcome any suggestions here. I also thought of using the c++ generator concept and have nested iterators. But the user needs to check status, and that makes it a bit awkward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We could consider using exceptions in some specific places, not internally but narrowly in exposing results to public APIs.
- I'm skeptical that the virtual call overhead in the public API layers is that high, relative to vcall and other costs internally that are needed to get ordered entries across the various levels. I suspect the proposed future buffering could optimize regular iterators as well as these. More detail? Disagree?
- Let's think through the aspect of this attempting to maximize opportunities for overlapping scans to reuse each others work. (Even if data ends up in the block cache, there can still be significant seeking work.) Imagine a concurrent tree data structure keeping track of pending latency-tolerant scan requests. We could have select read threads following a kind of elevator algorithm (always forward though) to serve up pending scan requests in key order, reusing work between overlapping requests. There should be more opportunities for reusing results vs. a multiscan API like this, because late breaking requests can still be (in a sense) coalesced with pending requests when doing work at or near their range. The downside of course is concurrency management, feeding results with minimal blocking, etc. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could consider using exceptions in some specific places, not internally but narrowly in exposing results to public APIs.
Yeah, this would be a more elegant way. Looks like we're doing this already in some places (internally though) - https://github.com/search?q=repo%3Afacebook%2Frocksdb+%22throw+std%22&type=code. Should be fine to do it in public APIs.
I'm skeptical that the virtual call overhead in the public API layers is that high, relative to vcall and other costs internally that are needed to get ordered entries across the various levels. I suspect the proposed future buffering could optimize regular iterators as well as these. More detail? Disagree?
I'm hoping to eliminate the vcall overhead in the internal iterators as well with the proposed buffering. So eventually we shouldn't be paying the vcall cost of Next()/Valid()/status()/key()/value() for every key. My rough idea is to have a buffer embedded in LevelIterator and DBIter, and shared and refilled in batches by the BlockBasedTableIterator and MergingIterator respectively. Might require some prototyping to find the best way to do it and get an idea of the perf improvement.
@anand1976 has updated the pull request. You must reimport the pull request before landing. |
@jaykorean @pdillinger I updated the PR with a modified |
That's an interesting idea. The synchronization and co-ordination across scans in different threads would add some overhead, as well as potentially increase scan latency. Whether its a net win depends on the workload I guess. That's what I'd be most worried about. The optimization I'm looking to enable with the multi-scan is for scans not necessarily overlapping in key range, but in pages. For example, 2 scans may include adjacent blocks that share the same 4KB page (or some multiple that we determine would benefit from coalescing). Also, I think the multiscan API is sort of orthogonal to what you propose, since its just a way to capture the user intent. |
const ReadOptions& /*options*/, const std::vector<ScanDesc>& scans) { | ||
std::unique_ptr<Iterator> iter(NewErrorIterator(Status::NotSupported())); | ||
std::unique_ptr<MultiScanIterator> ms_iter( | ||
new MultiScanIterator(scans, std::move(iter))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: wondering ScanOptions
can be a better name.
|
||
ScanIterator(const std::vector<ScanDesc>& scans, Iterator* db_iter) | ||
: scans_(scans), idx_(0), db_iter_(db_iter) { | ||
db_iter_->Seek(*scans_[idx_++].range.start); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to throw InvalidArg if idx_ >= scans_.size()
?
|
||
namespace ROCKSDB_NAMESPACE { | ||
|
||
// Descriptor for a RocksDB scan request. Only forward scans for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can add UNDER CONSTRUCTION or EXPERIMENTAL
|
||
class SingleIterator { | ||
public: | ||
using self_type = SingleIterator; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question for my learning. What does this using ...
do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are part of the iterator_traits, if I'm not mistaken. They provide a uniform interface to access the iterator, so callers can do things like for (auto iter : container)
, std::find
etc.
A multi scan API for users to pass a set of scan ranges and have the table readers determine the optimal strategy for performing the scans. This might include coalescing of IOs across scans, for example. The requested scans should be in increasing key order.
Eventually, I envision the
MultiScanIterator
to be built on top of a producer-consumer queue like container, with RocksDB (producer) enqueueing keys and values into the container and the application (consumer) dequeueing them. Unlike a traditional producer consumer queue, there is no concurrency here. The results will be buffered in the container, and when the buffer is empty a new batch will be read from the child iterators. This will allow the virtual function call overhead to be amortized over many entries.