Skip to content

Conversation

@austenLacy
Copy link
Contributor

TLDR

Related issue: #29906

This PR introduces an index backfill optimization to only read columns present in index definition different from the default behavior which is to read all columns. This optimization should help significantly for wide (i.e. many columns) tables, tables with large blob type columns (e.g. jsonb, text, etc) and reduces network i/o by only transmitting data required to create the index.

Details

Adds a new specialized scan function ybc_heap_beginscan_for_index_build() that uses IndexInfo to determine exactly which columns are needed and only requests those from DocDB.

The function identifies required columns from:

  1. Direct index columns (ii_IndexAttrNumbers) - columns directly referenced in the index key
  2. Expression index columns (ii_Expressions) - columns used in index expressions like (col1 + col2)
  3. Partial index predicates (ii_Predicate) - columns used in WHERE clauses like WHERE col1 > 50
  4. System columns - always includes ybctid (needed for index entry construction)

The feature is gated behind a new GUC yb_enable_index_backfill_column_projection.

Building and testing locally

Compile just postgres changes (much faster than full build)

./yb_build.sh release --target postgres --skip-java

Full build to run tests

./yb_build.sh release --skip-java

Test against a locally running cluster with SQL and expected output

# clean up any existing cluster if necessary
pkill -9 yb-master 2>/dev/null || true
pkill -9 yb-tserver 2>/dev/null || true
rm -rf /tmp/ybdata

# create single-node cluster
./bin/yb-ctl create --rf 1 --data_dir /tmp/ybdata \
  --master_flags "webserver_port=7001" \
  --tserver_flags "webserver_port=9001"

# run the test SQL and verify the output
./build/release-clang-dynamic-arm64/postgres/bin/ysqlsh \
  -h 127.0.0.1 -p 5433 -U yugabyte -d yugabyte \
  -f src/postgres/src/test/regress/sql/yb.orig.index_backfill_column_projection.sql

# clean up when done
./bin/yb-ctl destroy --data_dir /tmp/ybdata

Run java tests

# Run the full index regression test suite
./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressIndex#schedule' --scb --sj

…n the index

instead of the entire row. Behind a GUC flag called yb_enable_index_backfill_column_projection.
@netlify
Copy link

netlify bot commented Jan 9, 2026

Deploy Preview for infallible-bardeen-164bc9 ready!

Built without sensitive environment variables

Name Link
🔨 Latest commit b39fb98
🔍 Latest deploy log https://app.netlify.com/projects/infallible-bardeen-164bc9/deploys/696166efd5688c000773825c
😎 Deploy Preview https://deploy-preview-29928--infallible-bardeen-164bc9.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Comment on lines +1268 to +1274
/*
* For YB relations, use the optimized scan function that only
* fetches columns needed by the index. This avoids reading the
* entire row from DocDB during index build/backfill.
*/
uint32 flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE |
SO_ALLOW_STRAT;
Copy link
Contributor Author

@austenLacy austenLacy Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic was copied from tableam.h#table_beginscan_strat

* with many columns where only a few are indexed.
*/
TableScanDesc
ybc_heap_beginscan_for_index_build(Relation relation,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this duplicates code from ybcBeginScan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to Jason's comment.

It would be good to call into ybcBeginScan directly, if possible.
ybcBeginScan populates the required_attrs for the scan from the targetlist of the param Scan *pg_scan_plan. Please check if constructing a pg_scan_plan object is feasible.
The target list can be populated using the code that you have below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored both functions to use some shared logic helpers in a8937a1. Let me know what you think.

@karthik-ramanathan-3006
Copy link
Contributor

Hi @austenLacy,

Thanks for working on this optimization! I'll be reviewing this PR.
Please feel free to reach out via Slack or Github in case you have any questions.

This optimization touches a very critical path in our code base, so the review process will be quite detailed.
Please bear with me! Thanks for your patience!

Copy link
Contributor

@karthik-ramanathan-3006 karthik-ramanathan-3006 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this optimization.

I took an initial pass at the PR, and have left some comments/suggestions.
A good way to test your code would be to run the following collection of unit tests:

./yb_build.sh --cxx-test pgwrapper_pg_index_backfill-test

You can run the above command both with and without your optimizations in order to test for regressions.

* rather than all columns from the base table.
* Default is false (beta feature).
*/
extern bool yb_enable_index_backfill_column_projection;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flag can be defined as a GUC.
As an example, please take a look at yb_enable_inplace_index_update in guc.c (ref: guc.c)

The GUC can also be marked true, by default, for now.
When we get closed to merging the PR, we can make a decision on its default value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in b40157a

@@ -0,0 +1,133 @@
--
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding tests.
The way I am thinking about it, this optimization has two aspects -- performance and correctness:

  • Correctness: We want to ensure that we're fetching the correct columns as part of the scan
  • Performance: We want to validate that the data returned as part of the heap scan is minimized (when compared to the case where all columns are fetched by the scan).

Could you give a shot at implementing the latter?
The index backfill C++ test might be a good place for it (Ref: pg_index_backfill-test.cc)

This GUC might be useful to implement these tests: yb_fetch_size_limit
This GUC controls how much data is returned by a single scan. If we define the all the columns of the table being indexed to be of a fixed size, then we can calculate how many rows would be returned by the scan (with/without the optimization) and therefore how many round trips (RPCs) would be required to fetch all the matching rows from the table. The test could then assert that the number of RPCs has reduced by the expected number when the optimization is turned on.

Feel free to propose alternatives.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes a lot of sense. I can look into implementing a test like that.

* with many columns where only a few are indexed.
*/
TableScanDesc
ybc_heap_beginscan_for_index_build(Relation relation,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to Jason's comment.

It would be good to call into ybcBeginScan directly, if possible.
ybcBeginScan populates the required_attrs for the scan from the targetlist of the param Scan *pg_scan_plan. Please check if constructing a pg_scan_plan object is feasible.
The target list can be populated using the code that you have below.

@austenLacy
Copy link
Contributor Author

This optimization touches a very critical path in our code base, so the review process will be quite detailed.
Please bear with me! Thanks for your patience!

@karthik-ramanathan-3006 Thank you for your initial review! Totally understand this is a sensitive bit of code so no need to rush.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants