[YSQL]  Degraded Performance for IN Queries on Hash Columns for tables with High Tablet Counts

Jira Link: [DB-21357](https://yugabyte.atlassian.net/browse/DB-21357)
### Description

YugabyteDB supports batching for IN queries on hash keys; however, this mechanism is bypassed when tables have a high tablet count resulting degraded performance.

**Example to show the Batching on Hash Column**

**Setup:**

```
create table foo_3_tablets  (h int, v1 int, v2 int, v3 int, primary key (h hash)) split into 3 tablets;
insert into foo_3_tablets select i,i,i,i from generate_series(1,100000) as i;
```

The following example demonstrates how batching optimizes performance by allowing a single read request to fetch all necessary rows from every tablet.

**explain (analyze, dist, timing off)  select * from foo_3_tablets where h in (1,2,3,4,5,6);**
```
                                                  QUERY PLAN

-----------------------------------------------------------------------------------------------
 Index Scan using foo_3_tablets_pkey on foo_3_tablets  (cost=0.00..4.12 rows=1 width=16) (actual rows=6 loops
=1)
   Index Cond: (h = ANY ('{1,2,3,4,5,6}'::integer[]))
   Storage Table Read Requests: 1
   Storage Table Read Ops: 3
   Storage Table Rows Scanned: 6
 Planning Time: 0.060 ms
 Execution Time: 0.713 ms

```

**Example which shows that Batching on Hash Column is bypassed  with high tablet count** 

**Setup** 

```
create table foo_144_tablets (h int primary key , v1 int, v2 int, v3 int) split into 144 tablets;
insert into foo_144_tablets select i,i,i,i from generate_series(1,100000) as i;
```

Following query issues 256 Read Request, i.e  one Read Request for each value in the IN clause thus resulting in much higher latency.

**EXPLAIN (ANALYZE, DIST, TIMING OFF) SELECT * FROM foo_144_tablets WHERE
h IN (SELECT generate_series(1, 256));**

```
                                                           QUERY PLAN
--------------------------------------------------------------------------------------------
 YB Batched Nested Loop Join  (cost=27.55..69.07 rows=256 width=16) (actual rows=256 loops=1)
   Join Filter: (foo_144_tablets.h = (generate_series(1, 256)))
   ->  HashAggregate  (cost=4.50..6.50 rows=200 width=4) (actual rows=256 loops=1)
         Group Key: generate_series(1, 256)
         Batches: 1  Memory Usage: 61kB
         ->  ProjectSet  (cost=0.00..1.30 rows=256 width=4) (actual rows=256 loops=1)
               ->  Result *RESULT*  (cost=0.00..0.01 rows=1 width=0) (actual rows=1 loops=1)
   ->  Index Scan using foo_144_tablets_pkey on foo_144_tablets  (cost=23.05..62.56 rows=200 width=16) (act
ual rows=256 loops=1)
         Index Cond: (h = ANY (ARRAY[(generate_series(1, 256)), $1, $2, ..., $1023]))
         Storage Table Read Requests: 256
         Storage Table Read Ops: 256
         Storage Table Rows Scanned: 256
 Planning Time: 0.701 ms
 Execution Time: 663.094 ms
```

**Mitigation :** 
Increase the `work_mem` to higher value.  Based on the number of tablets this can be increased to higher value to avoid the regressions.
Default value for `work_mem` is 4MB .


**Batching (efficient) Execution with higher `work_mem`** , One read request  retrieves all relevant rows across every tablet.

```
set work_mem = '8MB';

EXPLAIN (ANALYZE, DIST, TIMING OFF) SELECT * FROM foo_144_tablets WHERE h IN (SELECT generate_series(1, 256));
                                                           QUERY PLAN

-----------------------------------------------------------------------------------------------------------
----------------------
 YB Batched Nested Loop Join  (cost=27.55..69.07 rows=256 width=16) (actual rows=256 loops=1)
   Join Filter: (foo_144_tablets.h = (generate_series(1, 256)))
   ->  HashAggregate  (cost=4.50..6.50 rows=200 width=4) (actual rows=256 loops=1)
         Group Key: generate_series(1, 256)
         Batches: 1  Memory Usage: 61kB
         ->  ProjectSet  (cost=0.00..1.30 rows=256 width=4) (actual rows=256 loops=1)
               ->  Result *RESULT*  (cost=0.00..0.01 rows=1 width=0) (actual rows=1 loops=1)
   ->  Index Scan using foo_144_tablets_pkey on foo_144_tablets  (cost=23.05..62.56 rows=200 width=16) (act
ual rows=256 loops=1)
         Index Cond: (h = ANY (ARRAY[(generate_series(1, 256)), $1, $2, ..., $1023]))
         Storage Table Read Requests: 1
         Storage Table Read Ops: 126
         Storage Table Rows Scanned: 256
 Planning Time: 0.638 ms
 Execution Time: 11.444 ms

```


### Issue Type

kind/bug

### Warning: Please confirm that this issue does not contain any sensitive information

- [x] I confirm this issue does not contain any sensitive information.

[DB-21357]: https://yugabyte.atlassian.net/browse/DB-21357?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YSQL] Degraded Performance for IN Queries on Hash Columns for tables with High Tablet Counts #31565

Description

Issue Type

Warning: Please confirm that this issue does not contain any sensitive information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[YSQL] Degraded Performance for IN Queries on Hash Columns for tables with High Tablet Counts #31565

Description

Description

Issue Type

Warning: Please confirm that this issue does not contain any sensitive information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions