[Enhancement] Improve Pulsar Broker cache defaults to get better out-of-the-box performance

### Search before asking

- [X] I searched in the [issues](https://github.com/apache/pulsar/issues) and found nothing similar.


Mailing list discussion thread: https://lists.apache.org/thread/5od69114jfrzo9dkbllxycq8o7ns341y

### Motivation

It's crucial to tune the Pulsar broker cache since the defaults in Pulsar are not optimal. Besides poor performance for Pulsar use cases, this leads to wasted CPU and unnecessary network transfer costs in cloud environments.
Tuning the Pulsar broker cache improves performance and reduces costs, especially with high fan-out use cases, Key_Shared subscriptions, and tiered storage.

### Solution

Here are some settings which would be better defaults.

- `maxMessagePublishBufferSizeInMB` - not broker cache related, but it's necessary to set it to an explicit value when fine-tuning broker cache settings so that direct memory OOM can be avoided. Default is 50% of direct memory. Set to `500`
- `managedLedgerCacheSizeMB` - the default is 20% of direct memory. It's better to set it to an explicit value to avoid direct memory OOM. Set to `512`
- `managedLedgerMaxReadsInFlightSizeInMB` - this feature is disabled by default. It's useful for avoiding direct memory OOM, which is a known issue with the default `dispatcherDispatchMessagesInSubscriptionThread=true` setting unless `managedLedgerMaxReadsInFlightSizeInMB` is set. Set to `500`. The value should be higher than `dispatcherMaxReadBatchSize` * `maxMessageSize`.
- `managedLedgerCacheEvictionTimeThresholdMillis` - the default `1000` is too low. Set to `10000`
- `managedLedgerCacheEvictionIntervalMs` - the default `10` is too low. Set to `5000` to avoid spending a lot of CPU with cache eviction.
- `managedLedgerMinimumBacklogCursorsForCaching` - the default `0` disables backlog cursors (catch-up read) caching. Set to `3`
- `managedLedgerMinimumBacklogEntriesForCaching` - the default `1000` is way too high. Set to `1`
- `managedLedgerMaxBacklogBetweenCursorsForCaching` - the default `10000` is way too low. Set to `2147483647` to disable the limit completely.

Sample settings for broker cache tuning:

yaml format:
```yaml
  maxMessagePublishBufferSizeInMB: 500
  managedLedgerCacheSizeMB: 512
  managedLedgerMaxReadsInFlightSizeInMB: 500
  managedLedgerCacheEvictionTimeThresholdMillis: 10000
  managedLedgerCacheEvictionIntervalMs: 5000
  managedLedgerMinimumBacklogCursorsForCaching: 3
  managedLedgerMinimumBacklogEntriesForCaching: 1
  managedLedgerMaxBacklogBetweenCursorsForCaching: 2147483647
```

`broker.conf` format:
```properties
maxMessagePublishBufferSizeInMB=500
managedLedgerCacheSizeMB=512
managedLedgerMaxReadsInFlightSizeInMB=500
managedLedgerCacheEvictionTimeThresholdMillis=10000
managedLedgerCacheEvictionIntervalMs=5000
managedLedgerMinimumBacklogCursorsForCaching=3
managedLedgerMinimumBacklogEntriesForCaching=1
managedLedgerMaxBacklogBetweenCursorsForCaching=2147483647
```

`managedLedgerMaxReadsInFlightSizeInMB` will have to be set to value that is higher than `dispatcherMaxReadBatchSize` * `maxMessageSize`. Otherwise it could result in error `Time-out elapsed while acquiring enough permits on the memory limiter to read from ledger [ledgerid], [topic], estimated read size [read size] bytes for [dispatcherMaxReadBatchSize] entries (check managedLedgerMaxReadsInFlightSizeInMB)`.
dispatcherMaxReadBatchSize defaults to 100 and maxMessageSize defaults to 5MB in bytes.
There's a separate issue to address the problem when `managedLedgerMaxReadsInFlightSizeInMB` < ``dispatcherMaxReadBatchSize` * `maxMessageSize`, https://github.com/apache/pulsar/issues/23482.


### Alternatives

_No response_

### Anything else?

The broker cache hit rate can be monitored with the Grafana dashboard found at https://github.com/datastax/pulsar-helm-chart/blob/master/helm-chart-sources/pulsar/grafana-dashboards/broker-cache-by-broker.json (Apache 2.0 license). The broker cache also impacts offloading. Offloading can be monitored with the dashboard available at https://github.com/apache/pulsar/blob/master/grafana/dashboards/offloader.json .

### Are you willing to submit a PR?

- [ ] I'm willing to submit a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Enhancement] Improve Pulsar Broker cache defaults to get better out-of-the-box performance #23466

Search before asking

Motivation

Solution

Alternatives

Anything else?

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Enhancement] Improve Pulsar Broker cache defaults to get better out-of-the-box performance #23466

Description

Search before asking

Motivation

Solution

Alternatives

Anything else?

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions