Skip to content

Conversation

@lluwm
Copy link
Contributor

@lluwm lluwm commented Nov 14, 2025

Problem Statement

We estimate the size of pubsubmessage and based on it to compute the aggregate size of drainer buffer queue. A limit is set on the queue to control its occupied heapsize. In a recent OOM heap dump analysis, we found that this estimation can inaccurate e.g. buffer size is about 400MB, whereas the limit setting is only 10MB, and the reason is that we use the sum of key, value, and pubSubPosition to estimate an ImmutablePubSubMessage size. However, when examining the dump, in this particular case, the major contributor was PubSubMessageHeaders which was about 16KB and we don't take it into account today.

Solution

This PR includes the header size inside getHeapSize estimation for ImmutablePubSubMessage.

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

…psize estimation

We estimate the size of pubsubmessage and based on it to compute the aggregate size of drainer buffer queue.
A limit is set on the queueto control its occupied heapsize. In a recent OOM heap dump analysis, we found
that this estimation can inaccurate e.g. buffer size is about 400MB, whereas the limit setting is only 10MB,
and the reason is that we use the sum of key, value, and pubSubPosition to estimate an ImmutablePubSubMessage size.
However, when examining the dump, in this particular case, the major contributor was PubSubMessageHeaders which was
about 16KB and we don't take it into account today.

This PR includes the header size inside getHeapSize estimation for ImmutablePubSubMessage.
Copilot AI review requested due to automatic review settings November 14, 2025 06:57
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses an inaccurate heap size estimation in ImmutablePubSubMessage that caused buffer size limits to be ineffective, leading to memory issues (e.g., 400MB actual buffer size vs 10MB configured limit).

Key Changes:

  • Modified getHeapSize() method to include PubSubMessageHeaders in the heap size calculation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings December 8, 2025 19:36
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

doReturn(mockSensor).when(mockMetricRepo).sensor(anyString(), any());
taskPollingService = Executors.newFixedThreadPool(1);
storeBufferService = new StoreBufferService(
3,
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The buffer capacity increase from 10000 to 100000 (10x) appears to be a significant change for a test configuration. While this change may be necessary due to the more accurate heap size calculation now including headers, it would be helpful to add a comment explaining why this specific value was chosen and how it relates to the heap size estimation fix. This will help future maintainers understand the relationship between this test configuration and the heap size calculation change.

Suggested change
3,
3,
// Increased buffer capacity from 10,000 to 100,000 (10x) due to more accurate heap size calculation now including headers.
// This value ensures the test configuration remains valid with the heap size estimation fix.

Copilot uses AI. Check for mistakes.
return SHALLOW_CLASS_OVERHEAD + InstanceSizeEstimator.getObjectSize(key)
+ InstanceSizeEstimator.getObjectSize(value) + InstanceSizeEstimator.getObjectSize(pubSubPosition);
+ InstanceSizeEstimator.getObjectSize(value) + InstanceSizeEstimator.getObjectSize(pubSubPosition)
+ (pubSubMessageHeaders != null ? InstanceSizeEstimator.getObjectSize(pubSubMessageHeaders) : 0);
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The heap size calculation now includes pubSubMessageHeaders, but there are no tests specifically verifying that getHeapSize() correctly accounts for headers when they are present. Consider adding a test case in InstanceSizeEstimatorTest that creates an ImmutablePubSubMessage with headers (using the 7-parameter constructor) to ensure the heap size estimation is accurate when headers are included. This is important since the PR description indicates headers can be as large as 16KB and are now a significant contributor to memory usage.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant