Skip to content

[BUG] [iOS] CallAgent created after sign-out/sign-in (different user) hangs in Connecting and ends with callEndReason 408/0 #2490

@andrii-lisovskyi

Description

@andrii-lisovskyi

Describe the bug
After signing out the current user, fully disposing CallAgent + CallClient, and signing in a different user with a fresh ACS identity in the same iOS process, the second user's callAgent.join(with: RoomCallLocator) reaches CallState.connecting and stays there for ~90 seconds. The server then ends the call with callEndReason.code = 408, subcode = 0.

Per the official troubleshooting-codes documentation, 408 with no subcode is: "Call controller timed out. Call Controller timed out waiting for protocol messages from user endpoints. Ensure clients are connected and available." — i.e. the ACS Call Controller is waiting for signaling messages from the second CallAgent that never arrive.

This strongly suggests SDK-internal state from the first session is wedging the second CallAgent's signaling channel, even though both the Swift CallAgent and CallClient instances are freshly constructed for User B.

Exception or Stack Trace
There is no thrown exception. callAgent.join returns no error; Call.callEndReason after the 90 s timeout is:

call.callEndReason.code    = 408
call.callEndReason.subcode = 0

To Reproduce
Steps to reproduce the behavior, all in a single iOS process:

  1. Launch the app. Sign in as User A (ACS identity A, ACS token with voip,chat scope, valid invitee in some Room R1).
  2. Construct CallClient and CallAgent for User A. Call callAgent.join(with: RoomCallLocator(roomId: R1.id)).
  3. Confirm the call reaches CallState.connected. Leave the call (call.hangUp(...)).
  4. Sign User A out and dispose the ACS stack:
    callAgent.dispose()
    callAgent = nil
    callClient.dispose()
    callClient = nil
  5. Sign in as User B (different ACS identity B, fresh voip,chat token, valid invitee in some Room R2 — same or different room as R1; outcome is identical).
  6. Construct a fresh CallClient and CallAgent for User B (callAgent is nil at this point). Call callAgent.join(with: RoomCallLocator(roomId: R2.id)).

Observed: join completion fires with no error in ~1 ms, call reaches Connecting in ~100 ms, stays there for 90 s, then Disconnected with callEndReason.code = 408, subcode = 0.

Code Snippet
Minimal shape of the second-user sign-in path:

// User B path (after User A was fully disposed)
let credential = try CommunicationTokenCredential(token: tokenForUserB)
let client = CallClient()
client.createCallAgent(userCredential: credential) { agent, error in
    guard let agent = agent, error == nil else { return }
    let locator = RoomCallLocator(roomId: roomForUserB)
    let opts = JoinCallOptions()
    agent.join(with: locator, joinCallOptions: opts) { call, error in
        // completion fires ok in ~1ms; `call.state` then goes to .connecting
        // and stays there ~90s before reaching .disconnected with code=408 subcode=0
    }
}

Expected behavior
User B's join reaches CallState.connected, the same way User A's did earlier in the same process.

Screenshots
N/A — failure is on the signaling layer, no UI artifact.

Setup (please complete the following information):

  • OS: iOS 18.x (physical device — does not reproduce on Simulator)
  • IDE: Xcode 26.3
  • Version of the Library used: AzureCommunicationCalling 2.18.2 (SwiftPM, latest as of 2026-03-10)

Additional context

Tokens are correct. We decode the JWT payload on receipt. Tokens for User A and User B have different ACS skypeid values (as expected), same resourceId, same voip,chat scope, are fresh (issuedAgo ≤ 1 s) and valid (~24 h). The 408 is not a token issue.

Selected log excerpts (single process, two consecutive sessions):

User A — succeeds:

[ACS-perf] +0ms    connect() called
[ACS-perf] +3ms    CallClient() initialized
[ACS-perf] +454ms  createCallAgent returned (ok)
[ACS-perf] +460ms  callAgent.join returned (ok)
[ACS-perf] +823ms  call.state -> Connecting
[ACS-perf] +4555ms call.state -> Connected

Logout (after User A leaves the call):

[ACS-tearDown] disposing CallAgent/CallClient for session end
[ACS-tearDown] CallAgent/CallClient disposed

User B — fails:

[ACS-perf] +0ms    connect() called
[ACS-perf] +0ms    CallClient() initialized
[ACS-perf] +86ms   createCallAgent returned (ok)
[ACS-perf] +87ms   callAgent.join returned (ok)
[ACS-perf] +107ms  call.state -> Connecting
... 90 seconds of silence ...
[ACS-perf] +91206ms call.state -> Disconnected
call.callEndReason.code = 408, subcode = 0

Note: createCallAgent returns in ~90 ms for User B vs ~450 ms for User A. The ~5× speedup with otherwise-identical setup strongly suggests reused SDK-internal infrastructure that survives dispose().

What we ruled out using the published code/subcode catalog:

Code/Subcode Meaning Observed?
403 / 5828 "Join isn't authorized — user isn't part of invitee list" No — both users are valid invitees
403 / 5829 "Beyond end time or before start time" No
403 / 5830 "Only ACS user can join the Rooms meeting" No
495 / 4507 "Invalid ACS token" No — token decoded and verified valid
410 / 3112 "Local media stack or ICE checks failed" No — not a media/firewall issue
408 / 10057 Rooms-specific "callee failed to finalize call setup" No
408 / 0 Generic "Call Controller timed out waiting for protocol messages" Yes ← this is us

The fact that we get 408/0 rather than the Rooms-specific 408/10057 is significant: ACS isn't classifying us as "the participant disappeared mid-join" — it's saying the second CallAgent isn't driving the signaling protocol on its side.

What we tried in app code:

  1. agent.dispose() then client.dispose() on a background queue on logout — required, but on its own the second user's callAgent.join completion never fires; the issue surfaces as a silent hang.
  2. Setting callAgent = nil, callClient = nil, clearing agent.delegate and any singleton CallAgentDelegate owner before dispose — required, no behavior change beyond Test Issue #1.
  3. Replacing our app-side singleton holding the CallClient/CallAgent with a brand-new instance after dispose. This changed the symptom from "join callback never returns" to "join callback returns ok in 1 ms, call reaches Connecting, server times out with 408/0 at 90 s" — i.e. it moved the failure from somewhere fully inside the SDK to a now-observable, server-acknowledged signaling stall.
  4. Waiting 30+ s of wall-clock between dispose and the next connect() (MSAL interactive sign-in time, in practice) — does not help. The wedge survives wall-clock time.

What we couldn't do, but would help:

  • Inspect CallAgent.connectionStatus. This property exists on the Android and JavaScript SDKs and is referenced in the Manage calls documentation as the way to detect a Disconnected agent that should be re-created. It is not exposed on the iOS SDK (verified against the public Swift interface and the framework's Obj-C symbol table for 2.18.2). On iOS we have no API to ask "is this CallAgent healthy" before calling join.

What we did not try:

  • agent.unregisterPushNotification — we don't use VoIP push at all, no PKPushRegistry, no CallKit. Including this for completeness.
  • Restarting the app process. We know this would work as a workaround, but it's not acceptable mid-incident for our use case.

Related existing issues:

Asks:

  1. Confirmation of whether this is a known wedge in 2.18.2.
  2. A documented procedure to fully reset the ACS Calling stack inside a single iOS process so that the second createCallAgent produces a fully functional agent.
  3. CallAgent.connectionStatus (or equivalent) exposed on iOS so apps can detect a wedged agent before calling join and avoid showing the user a 90 s spinner that resolves to a 408.

Happy to share full client logs and a .blog capture privately if useful.

Information Checklist

  • Bug Description Added
  • Repro Steps Added
  • Setup information Added

Metadata

Metadata

Labels

Communicationcustomer-reportedIssues that are reported by GitHub users external to the Azure organization.needs-triageWorkflow: This is a new issue that needs to be triaged to the appropriate team.questionThe issue doesn't require a change to the product in order to be resolved. Most issues start as that

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions