[Bug] maxMessagePublishBufferSizeInMB permits leak can stall and timeout connections

### Search before asking

- [x] I searched in the [issues](https://github.com/apache/pulsar/issues) and found nothing similar.


### Read release policy

- [x] I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.


### Version

master branch code analysis

### Minimal reproduce step

There's currently an issue that the org.apache.pulsar.broker.service.ServerCnx#completedSendOperation might not get called in error cases.
The impact of this is that message publishing could stop for all connections using a particular IO thread.

The broker `maxMessagePublishBufferSizeInMB` limit is split into a `maxPendingBytesPerThread` limit:
https://github.com/apache/pulsar/blob/3fce3097c76a9c8cb64cf3d8d87f6e050e6cb3a5/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L342-L343

The pending bytes is incremented in sending:
https://github.com/apache/pulsar/blob/3fce3097c76a9c8cb64cf3d8d87f6e050e6cb3a5/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L3357

It is decremented in ServerCnx#completedSendOperation method:
https://github.com/apache/pulsar/blob/3fce3097c76a9c8cb64cf3d8d87f6e050e6cb3a5/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L3376-L3377

If the call to decrement is missing, there will be a leak which will eventually cause all message publishing to stop for all connections using a particular IO thread.

**The leak happens here**:
https://github.com/apache/pulsar/blob/2a9d4ac85d8d786979afaa0b965cdb27375ae969/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentTopic.java#L732-L749

There should be a call to MessagePublishContext#completed for all exception cases. ServerCnx#completedSendOperation gets called for exception path in MessagePublishContext#completed here:
https://github.com/apache/pulsar/blob/3d0625ba64294fb0fe7dafc27c7a34883b4be51b/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/Producer.java#L480-L499

The other exception cases contain the required call to `callback.completed` which will call ServerCnx#completedSendOperation:
https://github.com/apache/pulsar/blob/2a9d4ac85d8d786979afaa0b965cdb27375ae969/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentTopic.java#L776-L794

### What did you expect to see?

There shouldn't be a leak in `maxPendingBytesPerThread` permits which eventually leads to message publishing stopping for all connections using a particular IO thread.

### What did you see instead?

Based on the analysis of the code, there's a leak.

### Anything else?

This might be related to issue #23920 

A heap dump could be used to check if the issue applies.   This can be done by searching `org.apache.pulsar.broker.service.ServerCnx$PendingBytesPerThreadTracker` instances in the heap dump and checking the `pendingBytes` and `limitExceeded` field values.


### Are you willing to submit a PR?

- [x] I'm willing to submit a PR!

	public synchronized void addFailed(ManagedLedgerException exception, Object ctx) {
	/* If the topic is being transferred(in the Releasing bundle state),
	we don't want to forcefully close topic here.
	Instead, we will rely on the service unit state channel's bundle(topic) transfer protocol.
	At the end of the transfer protocol, at Owned state, the source broker should close the topic properly.
	*/
	if (transferring) {
	if (log.isDebugEnabled()) {
	log.debug("[{}] Failed to persist msg in store: {} while transferring.",
	topic, exception.getMessage(), exception);
	}
	return;
	}

	PublishContext callback = (PublishContext) ctx;
	if (exception instanceof ManagedLedgerFencedException) {
	// If the managed ledger has been fenced, we cannot continue using it. We need to close and reopen
	close();

	public void completed(Exception exception, long ledgerId, long entryId) {
	if (exception != null) {
	final ServerError serverError = getServerError(exception);

	producer.cnx.execute(() -> {
	// if the topic is transferring, we don't send error code to the clients.
	if (producer.getTopic().isTransferring()) {
	if (log.isDebugEnabled()) {
	log.debug("[{}] Received producer exception: {} while transferring.",
	producer.getTopic().getName(), exception.getMessage(), exception);
	}
	} else if (!(exception instanceof TopicClosedException)) {
	// For TopicClosed exception there's no need to send explicit error, since the client was
	// already notified
	// For TopicClosingOrDeleting exception, a notification will be sent separately
	long callBackSequenceId = Math.max(highestSequenceId, sequenceId);
	producer.cnx.getCommandSender().sendSendError(producer.producerId, callBackSequenceId,
	serverError, exception.getMessage());
	}
	producer.cnx.completedSendOperation(producer.isNonPersistentTopic, msgSize);

	if (exception instanceof ManagedLedgerAlreadyClosedException) {
	if (log.isDebugEnabled()) {
	log.debug("[{}] Failed to persist msg in store: {}", topic, exception.getMessage());
	}

	callback.completed(new TopicClosedException(exception), -1, -1);
	return;

	} else {
	log.warn("[{}] Failed to persist msg in store: {}", topic, exception.getMessage());
	}

	if (exception instanceof ManagedLedgerTerminatedException && !isMigrated()) {
	// Signal the producer that this topic is no longer available
	callback.completed(new TopicTerminatedException(exception), -1, -1);
	} else {
	// Use generic persistence exception
	callback.completed(new PersistenceException(exception), -1, -1);
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] maxMessagePublishBufferSizeInMB permits leak can stall and timeout connections #23921

Search before asking

Read release policy

Version

Minimal reproduce step

What did you expect to see?

What did you see instead?

Anything else?

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	this.maxPendingBytesPerThread = conf.getMaxMessagePublishBufferSizeInMB() * 1024L * 1024L
	/ conf.getNumIOThreads();

	public void completedSendOperation(boolean isNonPersistentTopic, int msgSize) {
	PendingBytesPerThreadTracker.getInstance().decrementPublishBytes(msgSize, resumeThresholdPendingBytesPerThread);

[Bug] maxMessagePublishBufferSizeInMB permits leak can stall and timeout connections #23921

Description

Search before asking

Read release policy

Version

Minimal reproduce step

What did you expect to see?

What did you see instead?

Anything else?

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions