Index backfill optimization to only read columns present in index definition #29928

austenLacy · 2026-01-09T20:37:00Z

TLDR

Related issue: #29906

This PR introduces an index backfill optimization to only read columns present in index definition different from the default behavior which is to read all columns. This optimization should help significantly for wide (i.e. many columns) tables, tables with large blob type columns (e.g. jsonb, text, etc) and reduces network i/o by only transmitting data required to create the index.

Details

Adds a new specialized scan function ybc_heap_beginscan_for_index_build() that uses IndexInfo to determine exactly which columns are needed and only requests those from DocDB.

The function identifies required columns from:

Direct index columns (ii_IndexAttrNumbers) - columns directly referenced in the index key
Expression index columns (ii_Expressions) - columns used in index expressions like (col1 + col2)
Partial index predicates (ii_Predicate) - columns used in WHERE clauses like WHERE col1 > 50
System columns - always includes ybctid (needed for index entry construction)

The feature is gated behind a new GUC yb_enable_index_backfill_column_projection.

Building and testing locally

Compile just postgres changes (much faster than full build)

./yb_build.sh release --target postgres --skip-java

Full build to run tests

./yb_build.sh release --skip-java

Test against a locally running cluster with SQL and expected output

# clean up any existing cluster if necessary
pkill -9 yb-master 2>/dev/null || true
pkill -9 yb-tserver 2>/dev/null || true
rm -rf /tmp/ybdata

# create single-node cluster
./bin/yb-ctl create --rf 1 --data_dir /tmp/ybdata \
  --master_flags "webserver_port=7001" \
  --tserver_flags "webserver_port=9001"

# run the test SQL and verify the output
./build/release-clang-dynamic-arm64/postgres/bin/ysqlsh \
  -h 127.0.0.1 -p 5433 -U yugabyte -d yugabyte \
  -f src/postgres/src/test/regress/sql/yb.orig.index_backfill_column_projection.sql

# clean up when done
./bin/yb-ctl destroy --data_dir /tmp/ybdata

Run java tests

# Run the full index regression test suite
./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressIndex#schedule' --scb --sj

…n the index instead of the entire row. Behind a GUC flag called yb_enable_index_backfill_column_projection.

netlify · 2026-01-09T20:38:46Z

✅ Deploy Preview for infallible-bardeen-164bc9 ready!

Built without sensitive environment variables

Name	Link
🔨 Latest commit	`b39fb98`
🔍 Latest deploy log	https://app.netlify.com/projects/infallible-bardeen-164bc9/deploys/696166efd5688c000773825c
😎 Deploy Preview	https://deploy-preview-29928--infallible-bardeen-164bc9.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

austenLacy · 2026-01-09T20:46:33Z

src/postgres/src/backend/access/heap/heapam_handler.c

+			/*
+			 * For YB relations, use the optimized scan function that only
+			 * fetches columns needed by the index. This avoids reading the
+			 * entire row from DocDB during index build/backfill.
+			 */
+			uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE |
+								SO_ALLOW_STRAT;


This logic was copied from tableam.h#table_beginscan_strat

jasonyb · 2026-01-09T21:12:35Z

src/postgres/src/backend/access/yb_access/yb_scan.c

+ * with many columns where only a few are indexed.
+ */
+TableScanDesc
+ybc_heap_beginscan_for_index_build(Relation relation,


this duplicates code from ybcBeginScan

+1 to Jason's comment.

It would be good to call into ybcBeginScan directly, if possible.
ybcBeginScan populates the required_attrs for the scan from the targetlist of the param Scan *pg_scan_plan. Please check if constructing a pg_scan_plan object is feasible.
The target list can be populated using the code that you have below.

I refactored both functions to use some shared logic helpers in a8937a1. Let me know what you think.

karthik-ramanathan-3006 · 2026-01-12T15:01:08Z

Hi @austenLacy,

Thanks for working on this optimization! I'll be reviewing this PR.
Please feel free to reach out via Slack or Github in case you have any questions.

This optimization touches a very critical path in our code base, so the review process will be quite detailed.
Please bear with me! Thanks for your patience!

karthik-ramanathan-3006

Thanks for working on this optimization.

I took an initial pass at the PR, and have left some comments/suggestions.
A good way to test your code would be to run the following collection of unit tests:

./yb_build.sh --cxx-test pgwrapper_pg_index_backfill-test

You can run the above command both with and without your optimizations in order to test for regressions.

karthik-ramanathan-3006 · 2026-01-12T15:06:04Z

src/postgres/src/include/pg_yb_utils.h

+ * rather than all columns from the base table.
+ * Default is false (beta feature).
+ */
+extern bool yb_enable_index_backfill_column_projection;


This flag can be defined as a GUC.
As an example, please take a look at yb_enable_inplace_index_update in guc.c (ref: guc.c)

The GUC can also be marked true, by default, for now.
When we get closed to merging the PR, we can make a decision on its default value.

Added in b40157a

karthik-ramanathan-3006 · 2026-01-12T15:17:46Z

src/postgres/src/test/regress/sql/yb.orig.index_backfill_column_projection.sql

@@ -0,0 +1,133 @@
+--


Thank you for adding tests.
The way I am thinking about it, this optimization has two aspects -- performance and correctness:

Correctness: We want to ensure that we're fetching the correct columns as part of the scan

Performance: We want to validate that the data returned as part of the heap scan is minimized (when compared to the case where all columns are fetched by the scan).

Could you give a shot at implementing the latter?
The index backfill C++ test might be a good place for it (Ref: pg_index_backfill-test.cc)

This GUC might be useful to implement these tests: yb_fetch_size_limit
This GUC controls how much data is returned by a single scan. If we define the all the columns of the table being indexed to be of a fixed size, then we can calculate how many rows would be returned by the scan (with/without the optimization) and therefore how many round trips (RPCs) would be required to fetch all the matching rows from the table. The test could then assert that the number of RPCs has reduced by the expected number when the optimization is turned on.

Feel free to propose alternatives.

That makes a lot of sense. I can look into implementing a test like that.

src/postgres/src/backend/access/heap/heapam_handler.c

src/postgres/src/backend/access/yb_access/yb_scan.c

karthik-ramanathan-3006 · 2026-01-12T15:46:48Z

src/postgres/src/backend/access/yb_access/yb_scan.c

+ * with many columns where only a few are indexed.
+ */
+TableScanDesc
+ybc_heap_beginscan_for_index_build(Relation relation,


+1 to Jason's comment.

It would be good to call into ybcBeginScan directly, if possible.
ybcBeginScan populates the required_attrs for the scan from the targetlist of the param Scan *pg_scan_plan. Please check if constructing a pg_scan_plan object is feasible.
The target list can be populated using the code that you have below.

austenLacy · 2026-01-12T15:57:54Z

This optimization touches a very critical path in our code base, so the review process will be quite detailed.
Please bear with me! Thanks for your patience!

@karthik-ramanathan-3006 Thank you for your initial review! Totally understand this is a sensitive bit of code so no need to rush.

…to explicitly define the inclusion of key and non-key index columns

Adds optimization in index backfill that only reads columns present i…

b39fb98

…n the index instead of the entire row. Behind a GUC flag called yb_enable_index_backfill_column_projection.

austenLacy commented Jan 9, 2026

View reviewed changes

jasonyb reviewed Jan 9, 2026

View reviewed changes

karthik-ramanathan-3006 self-requested a review January 12, 2026 14:46

karthik-ramanathan-3006 requested changes Jan 12, 2026

View reviewed changes

austenLacy added 4 commits January 12, 2026 14:07

set exec_params regardless of index scan type

8be0002

configure yb_enable_index_backfill_column_projection properly as a GUC

b40157a

add clarity to ybc_heap_beginscan_for_index_build func documentation …

fb424ac

…to explicitly define the inclusion of key and non-key index columns

DRY code between ybcBeginScan and ybc_heap_beginscan_for_index_build

a8937a1

Index backfill optimization to only read columns present in index definition #29928

Are you sure you want to change the base?

Index backfill optimization to only read columns present in index definition #29928

Conversation

austenLacy commented Jan 9, 2026

TLDR

Details

Building and testing locally

Uh oh!

netlify bot commented Jan 9, 2026

✅ Deploy Preview for infallible-bardeen-164bc9 ready!

Uh oh!

austenLacy Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karthik-ramanathan-3006 commented Jan 12, 2026

Uh oh!

karthik-ramanathan-3006 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

austenLacy commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

austenLacy Jan 9, 2026 •

edited

Loading