Skip to content

[Bug] Update netty allocator options to avoid oom when using EntryFilters #25021

@geniusjoe

Description

@geniusjoe

Search before reporting

  • I searched in the issues and found nothing similar.

Read release policy

  • I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

User environment

Broker version: 3.0LTS, 4.0 LTS
Broker Operating system and hardware type: 5.4.241-1-tlinux4 (cloud vendor host Linux distributions, acted like CentOS)
Broker Java version: jdk 21
Client library type: Java
Client library version: apache-pulsar-3.0.4
Client Operating system and hardware type: 5.4.241-1-tlinux4
Client Java version: openjdk version "17.0.12" 2024-07-23 LTS

Issue Description

One of our maintained Pulsar instances's brokers (with a 16C32G spec.) frequently experience native memory OOM.
However, the memory usage shown in the jcmd <pid> native_memory summary does not match the top command process RES usage. Typically, just before an OOM occurred, top shows memory usage of around 31GB, while the total memory usage in the jcmd <broker-pid> native_memory summary is only about 20GB. Pulsar broker restarts due to OOM is abnormal and need to find out the reason.

Image

Error messages

Process was killed by K8S OOMKiller and print error log in kubectl describe pod command "Last State section" part: CGroup OOM encountered, victim process: java, pid: <broker-pid>

Reproducing the issue

I believe OOM scenario is related to the broker frequently reading large-sized messages from bookies and then quickly release them. This scenario is more likely to occurred in broadcast use case(with large amount of subscriptions and high filter ratio). Below is one of reproduce ways:

  1. Use 1 broker with 8C16G 5Gbps NIC hardware and 3 bookie, configure E:Qw:Qa = 2:2:1 for improving bookie's message throughput
  2. Create approximately 1000 subscriptions, each subscription only creates one consumer, and add unique properties to each subscription
  3. Send a 1MB message every 1 second, and each message's properties matching only one random subscription's properties among these 1000 subscriptions
  4. Configure broker EntryFilters to ensure that each message will only be sent to one subscription that matches it, while all other subscriptions cannot receive the message
  5. Make sure broker ledger cache is always missed

Additional information

I believe this issue is mainly related to the memory allocation mechanism of Netty's ByteBuf allocator:

  1. When message size is large with high filtering ratio, due to Pulsar's default configuration of dispatcherMaxReadSizeBytes=5MB and dispatcherMaxBatchSize=100, a single readMoreEntries() batch can easily reach the maximum read size limit of 5MB. With the current E:Qw:Qa = 2:2:1 mode, a single read operation only requests from two channel eventLoop threads. Since the default netty chunk size is 4MB(DEFAULT_PAGE_SIZE << DEFAULT_MAX_ORDER), it's easy for the existing chunks in the netty poolArena to have insufficient space, requiring allocation of new chunks from native memory.

  2. The newly allocated chunks might be fully occupied by current readMoreEntries() operation. Due to the high message filtering ratio, it's very likely that messages read in this batch are all filtered out and not delivered to consumers, causing the entry byteBufs to be quickly released in the broker almost at the same time.

  3. Netty default threadLocal cache size maxCachedBufferCapacity is 32KB. When a single byteBuf exceeds the cache size, it cannot be cached in threadLocal but directly released and returned to the chunk. When all entries read in this batch are larger than 32KB and released at the same time, it will triggering Netty's chunk recycle strategy that free chunk and give back memory to OS. And I noticed netty allocate and release memory very often in JFR(Java Flight Recorder): Image

  4. This issue might be related to the implementation of JDK runtime, Linux distribution, or glibc different malloc() realization. However, I believe it should be a common problem. In my test environment, when the JDK allocates direct memory smaller than 64MB (i.e., each chunk allocation is 4MB), using command pmap -x <broker-pid> | sort -n -k3 reveals that the OS always requests a memory segment of 61444 KB. When chunks are too frequently allocated and released (maybe at a rate 20times/s), I find that the OS memory cleanup speed cannot keep up with the allocation speed. Which results in the native memory size seen via the top command being significantly larger than in the jcmd <broker-pid> VM.native_memory summary.Image

To solve this issue, maybe we can update below netty related options:
-Dio.netty.allocator.maxOrder=13 -Dio.netty.allocator.maxCachedBufferCapacity=8388608

  1. The maxOrder parameter adjusts the maximum size of a single chunk. Changing the default from 4M to 64M increases the number of byteBufs that can be stored in a single chunk. Furthermore, after increasing the single chunk size to 64MB, it can accommodate a default maximum 5MB single message.
    Netty changed the default chunk size from 16M to 4M in version 4.1.76. The primary reason was that the Netty community believed that in most scenarios, each PoolArena does not require very large chunk sizes, as it could easily lead to wasted memory resources in this pr Reduce the default PooledByteBufAllocator chunk size from 16 MiB to 4 MiB netty/netty#12108: "If we don't need a lot of memory, and certainly not compared to the number of cores on a system, then this will take up more memory than necessary, since each chunk is 16 MiB.", which I think is not very suitable for pulsar use case.
    In practice, the Logstash community also already reverted it back to the original default value of 16MB. Handle Netty's change of default value for maxOrder elastic/logstash#15765

  2. Adjusting the maxCachedBufferCapacity parameter increases the maximum size of a single thread-local cache from the default 32KB to 8MB. As a result, the byteBuf corresponding to the default maximum 5MB message can now be cached in the thread-local cache, thereby extending the lifecycle of chunks. Additionally, increasing the cache size reduces the frequency of memory allocation requests by pulsar eventGroup thread-local caches from the PoolArena, which in turn decreases lock waiting time during concurrent memory allocation in the PoolArena.

After adjusting the Netty options mentioned above, I observed a significant reduction in Netty native memory allocation events shown in JFR (Java Flight Recorder), with the frequency dropping to about one allocation and deallocation every 1.5 minutes. Furthermore, the memory usage reported in the JVM's jcmd <broker-pid> native_memory summary now almost is the same as which displayed by the top command. I think under current configuration, the operating system can have enough time to recycle native memory, and after adjustment there is no OOM event anymore.

Image

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Labels

type/bugThe PR fixed a bug or issue reported a bug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions