-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Index backfill optimization to only read columns present in index definition #29928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Index backfill optimization to only read columns present in index definition #29928
Conversation
…n the index instead of the entire row. Behind a GUC flag called yb_enable_index_backfill_column_projection.
✅ Deploy Preview for infallible-bardeen-164bc9 ready!Built without sensitive environment variables
To edit notification comments on pull requests, go to your Netlify project configuration. |
| /* | ||
| * For YB relations, use the optimized scan function that only | ||
| * fetches columns needed by the index. This avoids reading the | ||
| * entire row from DocDB during index build/backfill. | ||
| */ | ||
| uint32 flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE | | ||
| SO_ALLOW_STRAT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic was copied from tableam.h#table_beginscan_strat
| * with many columns where only a few are indexed. | ||
| */ | ||
| TableScanDesc | ||
| ybc_heap_beginscan_for_index_build(Relation relation, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this duplicates code from ybcBeginScan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to Jason's comment.
It would be good to call into ybcBeginScan directly, if possible.
ybcBeginScan populates the required_attrs for the scan from the targetlist of the param Scan *pg_scan_plan. Please check if constructing a pg_scan_plan object is feasible.
The target list can be populated using the code that you have below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored both functions to use some shared logic helpers in a8937a1. Let me know what you think.
|
Hi @austenLacy, Thanks for working on this optimization! I'll be reviewing this PR. This optimization touches a very critical path in our code base, so the review process will be quite detailed. |
karthik-ramanathan-3006
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this optimization.
I took an initial pass at the PR, and have left some comments/suggestions.
A good way to test your code would be to run the following collection of unit tests:
./yb_build.sh --cxx-test pgwrapper_pg_index_backfill-test
You can run the above command both with and without your optimizations in order to test for regressions.
| * rather than all columns from the base table. | ||
| * Default is false (beta feature). | ||
| */ | ||
| extern bool yb_enable_index_backfill_column_projection; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This flag can be defined as a GUC.
As an example, please take a look at yb_enable_inplace_index_update in guc.c (ref: guc.c)
The GUC can also be marked true, by default, for now.
When we get closed to merging the PR, we can make a decision on its default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in b40157a
| @@ -0,0 +1,133 @@ | |||
| -- | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding tests.
The way I am thinking about it, this optimization has two aspects -- performance and correctness:
- Correctness: We want to ensure that we're fetching the correct columns as part of the scan
- Performance: We want to validate that the data returned as part of the heap scan is minimized (when compared to the case where all columns are fetched by the scan).
Could you give a shot at implementing the latter?
The index backfill C++ test might be a good place for it (Ref: pg_index_backfill-test.cc)
This GUC might be useful to implement these tests: yb_fetch_size_limit
This GUC controls how much data is returned by a single scan. If we define the all the columns of the table being indexed to be of a fixed size, then we can calculate how many rows would be returned by the scan (with/without the optimization) and therefore how many round trips (RPCs) would be required to fetch all the matching rows from the table. The test could then assert that the number of RPCs has reduced by the expected number when the optimization is turned on.
Feel free to propose alternatives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes a lot of sense. I can look into implementing a test like that.
| * with many columns where only a few are indexed. | ||
| */ | ||
| TableScanDesc | ||
| ybc_heap_beginscan_for_index_build(Relation relation, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to Jason's comment.
It would be good to call into ybcBeginScan directly, if possible.
ybcBeginScan populates the required_attrs for the scan from the targetlist of the param Scan *pg_scan_plan. Please check if constructing a pg_scan_plan object is feasible.
The target list can be populated using the code that you have below.
@karthik-ramanathan-3006 Thank you for your initial review! Totally understand this is a sensitive bit of code so no need to rush. |
…to explicitly define the inclusion of key and non-key index columns
TLDR
Related issue: #29906
This PR introduces an index backfill optimization to only read columns present in index definition different from the default behavior which is to read all columns. This optimization should help significantly for wide (i.e. many columns) tables, tables with large blob type columns (e.g. jsonb, text, etc) and reduces network i/o by only transmitting data required to create the index.
Details
Adds a new specialized scan function
ybc_heap_beginscan_for_index_build()that usesIndexInfoto determine exactly which columns are needed and only requests those from DocDB.The function identifies required columns from:
ii_IndexAttrNumbers) - columns directly referenced in the index keyii_Expressions) - columns used in index expressions like(col1 + col2)ii_Predicate) - columns used in WHERE clauses likeWHERE col1 > 50ybctid(needed for index entry construction)The feature is gated behind a new GUC
yb_enable_index_backfill_column_projection.Building and testing locally
Compile just postgres changes (much faster than full build)
Full build to run tests
Test against a locally running cluster with SQL and expected output
Run java tests