Skip to content

Conversation

@emelialei88
Copy link
Collaborator

@emelialei88 emelialei88 commented Oct 24, 2025

This PR introduces several refactors and enables authentication:

  • Adds an Authenticator class to encapsulate authentication logic.
  • Migrates connection initialization logic from InitialConnectionHandler to InitialConnectionContext.

@emelialei88 emelialei88 requested a review from a team as a code owner October 24, 2025 19:53
@emelialei88 emelialei88 marked this pull request as draft October 24, 2025 20:58
@678098 678098 self-assigned this Nov 6, 2025
@678098 678098 self-requested a review November 6, 2025 16:05
@emelialei88 emelialei88 force-pushed the authn/change-negotiator branch 2 times, most recently from d2c4c40 to ea2450d Compare November 12, 2025 18:51
@emelialei88 emelialei88 marked this pull request as ready for review November 12, 2025 19:38
Copy link
Collaborator

@dorjesinpo dorjesinpo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interim review.
we have pairs of the following:
decodeInitialConnectionMessage,
scheduleRead,
readCallback
readBlob,
processBlob,
complete
... in both InitialConnectionHandler and InitialConnectionContext.

The point of FSM-like state management is that everything happens in handleEvent and there is only one place where we call complete
Otherwise, we can't rely on controlling actions by the state

Is it possible to engage handleEvent immediately in InitialConnectionHandler::handleInitialConnection and call InitialConnectionContext::complete from handleEvent only.

We seem to have InitialConnectionContext::handleInitialConnection but we call InitialConnectionHandler::handleInitialConnection instead which does not seem to support auth?

context));

// Register as observer of the channel to get the 'onClose'
bsl::weak_ptr<InitialConnectionContext> context_wp =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation behind the use of weak_ptr?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we have InitialConnectionContext bind to onClose, which means it won't be destroyed until the channel closes. It's works but I was hoping to clean it up once the initial connection part finishes, to make the lifetime match with the semantic.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since there is d_initialConnectionContextCache now, maybe we do not need to bind bsl::shared_ptr<InitialConnectionContext> anymore?

bslmt::LockGuard<bslmt::Mutex> guard(&d_mutex); // LOCKED

if (d_state != AuthenticationState::e_AUTHENTICATING) {
errorDescription << "State not AUTHENTICATING (was " << d_state << ")";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was -> is

@dorjesinpo dorjesinpo assigned emelialei88 and unassigned dorjesinpo Nov 14, 2025
@emelialei88 emelialei88 force-pushed the authn/change-negotiator branch 2 times, most recently from a009820 to e4183da Compare November 17, 2025 15:58
Copy link
Collaborator Author

@emelialei88 emelialei88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Response

context));

// Register as observer of the channel to get the 'onClose'
bsl::weak_ptr<InitialConnectionContext> context_wp =
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we have InitialConnectionContext bind to onClose, which means it won't be destroyed until the channel closes. It's works but I was hoping to clean it up once the initial connection part finishes, to make the lifetime match with the semantic.

Copy link
Collaborator

@dorjesinpo dorjesinpo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interim review

// BDE
#include <bdlbb_pooledblobbufferfactory.h>
#include <bsl_iostream.h>
#include <bsl_memory.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just double checking if this is necessary

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indirectly included in mqbmock_cluster.h but maybe we shouldn't depend on indirect includes. Needed for shared_ptr.

}

void InitialConnectionContext::handleEvent(
int statusCode,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the meaning of statusCode is not clear. It seems that all errors are communicated as InitialConnectionEvent::e_ERROR and TCPSessionFactory::initialConnectionComplete just needs statusCode != 0 (which is not enforced by InitialConnectionEvent::e_ERROR case).
Unless we want to communicate exact value to bmqio::Channel::close for some reason, the code can be simplified by removing statusCode argument.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the errors that occur during initial connection don’t print the error code or logs until initialConnectionComplete is called. If we were to follow this convention, it still feels necessary to pass the error code through handleEvent to initialConnectionComplete so it can be properly logged.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. But d_state is primary and statusCode is secondary. On the line 710, there is no check for statusCode, if it is 0, there will be no complete call.

Maybe, we should rely on InitialConnectionState::e_FAILED and InitialConnectionState::e_NEGOTIATED being the only final states and call complete as soon as we reach them. Assuming, any other state is transitory and awaits for more event(s).

context));

// Register as observer of the channel to get the 'onClose'
bsl::weak_ptr<InitialConnectionContext> context_wp =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since there is d_initialConnectionContextCache now, maybe we do not need to bind bsl::shared_ptr<InitialConnectionContext> anymore?

} // close mutex lock guard // UNLOCK

// Keep track of active channels, for logging purposes
++d_nbActiveChannels;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moving ++d_nbActiveChannels changes the semantics of it. It used to track all channels, including incomplete ones.

Copy link
Collaborator Author

@emelialei88 emelialei88 Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be the race condition we discussed earlier, though I don’t recall the exact scenario. All the tests are passing now — possibly because the fuzz tests have sped up and no longer reproduce the issue. Definitely a good reminder for me to document these cases better next time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall having any problems with d_nbActiveChannels, we do not seem to check it, only log

Copy link
Contributor

@chrisbeard chrisbeard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave the authenticator interfaces a first pass, looking good overall, some comments about the implementation details and minor design bits.

Comment on lines 152 to 153
/// The authentication message received.
bmqp_ctrlmsg::AuthenticationMessage d_authenticationMessage;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation for storing this for the lifetime of the connection? We should not store this material once authentication completes. Can that be avoided?

Comment on lines 155 to 156
/// The encoding type used for sending a message. It should match with the
/// encoding type of the received message.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this comment feels like it can be made clearer, perhaps something like:

// The encoding type used for sending authentication responses. The value matches
// the encoding type set by the client during authentication and ensures that all
// responses are sent using the same encoding.

const bsl::shared_ptr<mqbplug::AuthenticationResult>& value);

// NOTE: AuthenticationMessage and encodingType are set only when
// AuthenticationState is e_AUTHENTICATED. authenticationMessage() and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the state mentioned in this comment correct? e_AUTHENTICATED feels wrong here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two functions are called only during reauthentication. The state should be e_AUTHENTICATED at that time.

Comment on lines 77 to 89
/// Produce and send outbound authentication message with the specified
/// `context`. Return 0 on success, or a non-zero error code and populate
/// the specified `errorDescription` with a description of the error
/// otherwise.
virtual int authenticationOutbound(
bsl::ostream& errorDescription,
const bsl::shared_ptr<AuthenticationContext>& context) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method feels a little out of place. Maybe that means design-wise we should rethink this, leave it out for now, or documentation on the method and interface should better define its purpose.

Comment on lines 67 to 68
const int k_MIN_THREADS = 1; // Minimum number of threads in the thread pool
const int k_MAX_THREADS = 3; // Maximum number of threads in the thread pool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be configuration-driven.

Comment on lines +1029 to +1031
context->setAuthenticationMessage(authenticationMessage);
context->setAuthenticationEncodingType(
event.authenticationEventEncodingType());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment here, we should prefer to not store the authentication message for the lifetime of the connection.

int rc = rc_SUCCESS;
bsl::string error;

// Setup error handler based on whether this is initial authn or reauthn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Setup error handler based on whether this is initial authn or reauthn
// Set up error handler based on whether this is initial authn or reauthn

bsl::optional<bdlb::ScopeExitAny> scopeGuard;

if (isReauthn) {
// For reauthentication: setup error guard to handle failures
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// For reauthentication: setup error guard to handle failures
// For reauthentication: set up error guard to handle failures

channel));
}
else {
// For initial authentication: setup state machine transition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// For initial authentication: setup state machine transition
// For initial authentication: set up state machine transition

}

bmqu::MemOutStream errorStream;
errorStream << "Authentication timeout after " << lifetime << " ms";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error could be made clearer. If I understand correctly, it's trying to convey that the client did not reauthenticate within before the lifetime expired?

Copy link
Collaborator

@dorjesinpo dorjesinpo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few comments, mostly minor


BSLS_ASSERT_SAFE(
d_initialConnectionContextCache.contains(initialConnectionContext_p));
bsl::shared_ptr<InitialConnectionContext> initialConnectionContext_sp =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the purpose of this seems to be to keep the object alive until this method returns since there is no other reference left. Is that correct? If so, it is worth a comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a comment for d_initialConnectionContextCache in mqbnet::tcpsessionfactory.h.

if (scheduleRc != 0) {
rc = (scheduleRc * 10) + rc_SCHEDULE_REAUTHN_FAILED;
error = scheduleErrStream.str();
return; // RETURN
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is better still to respond (with an error)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like failing to schedule reauthentication is more an internal error than something to notify the client.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should always respond to request, client logic may depend on it. In this case, it looks like an internal error and the response shoudl indicate an error.

mqbnet::AuthenticationState::e_AUTHENTICATING // state
);

context->setAuthenticationContext(authenticationContext);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant something like BSLS_ASSERT_SAFE(!d_authenticationCtxSp); in InitialConnectionContext::setAuthenticationContext

} // close mutex lock guard // UNLOCK

// Keep track of active channels, for logging purposes
++d_nbActiveChannels;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall having any problems with d_nbActiveChannels, we do not seem to check it, only log

@dorjesinpo dorjesinpo assigned emelialei88 and unassigned dorjesinpo Nov 20, 2025
Copy link
Collaborator

@dorjesinpo dorjesinpo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still think we always need to respond to auth request even in the case of internal error.
A minor point - since d_initialConnectionContextCache, there no need for bsl::enable_shared_from_this<InitialConnectionContext>

@emelialei88 emelialei88 force-pushed the authn/change-negotiator branch 2 times, most recently from aceb28b to 477467a Compare December 8, 2025 18:28
@emelialei88 emelialei88 assigned dorjesinpo and unassigned emelialei88 Dec 8, 2025
@emelialei88 emelialei88 force-pushed the authn/change-negotiator branch from 477467a to a2d36a6 Compare December 11, 2025 17:04
if (rc != 0) {
errorDescription << "Failed to enqueue authentication job for '"
<< channel->peerUri() << "' [rc: " << rc
<< ", message: " << context->authenticationMessage()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we want to include the authentication material here.

Comment on lines +268 to +271
const int authnRc = d_authnController_p->authenticate(
authnErrStream,
&result,
authenticationRequest.mechanism(),
authenticationData);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially open question: it doesn't look like we log the result of authentication here or in the authenticationcontroller. Is the expectation that the plugin implementation should log if you want logs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a related note, control plane requests (e.g. openQueue, configureQueue, etc) have timers that track how long the broker took to handle the request. Do we have a plan for supporting something similar for authN?

Copy link
Collaborator Author

@emelialei88 emelialei88 Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging added to authenticationcontroller including timing authentication.

@emelialei88 emelialei88 force-pushed the authn/change-negotiator branch 2 times, most recently from 346e3ea to 1baee3d Compare December 12, 2025 20:05
@dorjesinpo
Copy link
Collaborator

All ITs failures are not related to the PR

  1. test_poison_rda_reset_priority_active - Fix[MQB]: downstream clears app too early #1042
  2. test_dead_proxy[multi_node_fsm-strong_consistency] - Fix[mqb]: isFinal race when dropping handle #1052 and Fix[mqb]: do not processPendingContexts on failed reopen #1055

dorjesinpo
dorjesinpo previously approved these changes Dec 17, 2025
Copy link
Collaborator

@dorjesinpo dorjesinpo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you

@emelialei88 emelialei88 force-pushed the authn/change-negotiator branch 2 times, most recently from bbcc3ad to cc5cd69 Compare December 22, 2025 20:24
@emelialei88 emelialei88 force-pushed the authn/change-negotiator branch from f50df82 to cc6bc15 Compare December 29, 2025 20:27
@emelialei88 emelialei88 requested a review from dorjesinpo January 2, 2026 20:02
@emelialei88 emelialei88 force-pushed the authn/change-negotiator branch from cc6bc15 to 48fb1ad Compare January 2, 2026 20:03
Copy link
Collaborator

@dorjesinpo dorjesinpo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one comment about possible improvement

// Transition to the next state
event.value() = InitialConnectionEvent::e_AUTHN_SUCCESS;
if (isReauthn) {
// For reauthentication, we're done
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this a bug that closed the channel upon reauth?

If so, this function may be restructured to better handle isReauthn fork. For example, call handleEvent in the case of reauth with the transition e_NEGOTIATED -> e_AUTHN_SUCCESS -> e_NEGOTIATED if successful, e_NEGOTIATED -> e_ERROR -> e_FAILED otherwise.

@dorjesinpo dorjesinpo assigned emelialei88 and unassigned dorjesinpo Jan 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants