OpenSIPS version you are running
version: opensips 3.6.3 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, F_PARALLEL_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: d5222226a
main.c compiled on with gcc 12
Description
In an anycast cluster with two OpenSIPS nodes using dialog_replication_cluster and tm_replication_cluster, the dlg_replicated_delete() function does not call run_dlg_callbacks(DLGCB_TERMINATED, ...). This means that when a dialog is terminated on one node and the deletion is replicated to the node that originally created the dialog, modules like pua_dialoginfo never learn about the termination. As a result, PUBLISH with <state>terminated</state> is never sent, leaving stale presentity records in the database and causing BLF (Busy Lamp Field) indicators to remain lit indefinitely after a call ends.
Setup
- Two OpenSIPS 3.6 nodes behind an anycast IP.
tm_replication_cluster = 1 (anycast TM replication)
dialog_replication_cluster = 1
- Modules:
dialog, presence, pua, pua_dialoginfo, presence_dialoginfo
pua_dialoginfo with presence_server pointing to the anycast IP
Scenario
- INVITE arrives at Node 1 (via anycast).
create_dialog() and dialoginfo_set("A") are called. The pua_dialoginfo module registers DLGCB_TERMINATED callback on this dialog. A PUBLISH with <state>confirmed</state> is sent to the local presence module. The pua record (including the ETag) is stored in Node 1's pua hashtable. The dialog is replicated to Node 2.
- Due to asymmetric anycast routing, BYE from the callee (Asterisk) arrives at Node 2.
loose_route() succeeds (dialog was replicated). Node 2 forwards BYE to the caller. The 200 OK from the caller arrives at Node 1 (anycast), which replicates it to Node 2 via tm_replication_cluster.
- Node 2 completes the BYE transaction. The dialog module fires
DLGCB_TERMINATED. pua_dialoginfo calls dialog_publish("terminated", ...) with expires=0. However, the pua module on Node 2 has no matching pua record (it was created on Node 1), so send_publish_int() returns ERR_PUBLISH_NO_RECORD and the PUBLISH is silently dropped:
// modules/pua/send_publish.c, send_publish_int()
if(presentity== NULL)
{
if(publ->expires== 0)
{
LM_DBG("request for a publish with expires 0 and"
" no record found\n");
ret = ERR_PUBLISH_NO_RECORD;
goto error;
}
- Node 2 replicates the dialog deletion to Node 1.
dlg_replicated_delete() runs on Node 1 — it transitions the dialog state and frees resources, but never calls run_dlg_callbacks(DLGCB_TERMINATED, ...):
// modules/dialog/dlg_replication.c, dlg_replicated_delete()
destroy_linkers(dlg);
remove_dlg_prof_table(dlg, 0);
next_state_dlg(dlg, DLG_EVENT_REQBYE, ...);
// ... remove timer ...
unref_dlg(dlg, 1 + unref); // dialog freed without DLGCB_TERMINATED!
- Result: Node 1 (which has the pua record with the correct ETag) never fires
DLGCB_TERMINATED → pua_dialoginfo never sends PUBLISH terminated → the presentity record remains confirmed → BLF stays lit.
Comparison with local BYE processing
When BYE is processed locally, DLGCB_TERMINATED is always called:
dlg_handlers.c line 2150: run_dlg_callbacks(DLGCB_TERMINATED, dlg, req, ...)
dlg_req_within.c line 261: run_dlg_callbacks(DLGCB_TERMINATED, dlg, fake_msg, ...)
dlg_handlers.c (timeout) line 2555: run_dlg_callbacks(DLGCB_EXPIRED, dlg, fake_msg, ...)
Only dlg_replicated_delete() is missing this callback.
Proposed fix
Add run_dlg_callbacks(DLGCB_TERMINATED, ...) to dlg_replicated_delete(), using a fake message and processing context (same pattern as dual_bye_event() in dlg_req_within.c:257-270):
// After remove_dlg_timer() and before unref_dlg():
{
struct sip_msg *fake_msg;
context_p old_ctx;
context_p *new_ctx;
if (push_new_processing_context(dlg, &old_ctx, &new_ctx,
&fake_msg) == 0) {
run_dlg_callbacks(DLGCB_TERMINATED, dlg, fake_msg,
DLG_DIR_NONE, -1, NULL, 0, 1);
if (current_processing_ctx == NULL)
*new_ctx = NULL;
else
context_destroy(CONTEXT_GLOBAL, *new_ctx);
set_global_context(old_ctx);
release_dummy_sip_msg(fake_msg);
}
}
This requires adding #include "dlg_req_within.h" to dlg_replication.c.
Versions affected
- 3.6 branch (confirmed)
- master (confirmed — same code)
OpenSIPS version you are running
Description
In an anycast cluster with two OpenSIPS nodes using
dialog_replication_clusterandtm_replication_cluster, thedlg_replicated_delete()function does not callrun_dlg_callbacks(DLGCB_TERMINATED, ...). This means that when a dialog is terminated on one node and the deletion is replicated to the node that originally created the dialog, modules likepua_dialoginfonever learn about the termination. As a result,PUBLISHwith<state>terminated</state>is never sent, leaving stale presentity records in the database and causing BLF (Busy Lamp Field) indicators to remain lit indefinitely after a call ends.Setup
tm_replication_cluster = 1(anycast TM replication)dialog_replication_cluster = 1dialog,presence,pua,pua_dialoginfo,presence_dialoginfopua_dialoginfowithpresence_serverpointing to the anycast IPScenario
create_dialog()anddialoginfo_set("A")are called. Thepua_dialoginfomodule registersDLGCB_TERMINATEDcallback on this dialog. APUBLISHwith<state>confirmed</state>is sent to the local presence module. The pua record (including the ETag) is stored in Node 1's pua hashtable. The dialog is replicated to Node 2.loose_route()succeeds (dialog was replicated). Node 2 forwards BYE to the caller. The 200 OK from the caller arrives at Node 1 (anycast), which replicates it to Node 2 viatm_replication_cluster.DLGCB_TERMINATED.pua_dialoginfocallsdialog_publish("terminated", ...)withexpires=0. However, the pua module on Node 2 has no matching pua record (it was created on Node 1), sosend_publish_int()returnsERR_PUBLISH_NO_RECORDand the PUBLISH is silently dropped:dlg_replicated_delete()runs on Node 1 — it transitions the dialog state and frees resources, but never callsrun_dlg_callbacks(DLGCB_TERMINATED, ...):DLGCB_TERMINATED→pua_dialoginfonever sendsPUBLISH terminated→ the presentity record remainsconfirmed→ BLF stays lit.Comparison with local BYE processing
When BYE is processed locally,
DLGCB_TERMINATEDis always called:dlg_handlers.cline 2150:run_dlg_callbacks(DLGCB_TERMINATED, dlg, req, ...)dlg_req_within.cline 261:run_dlg_callbacks(DLGCB_TERMINATED, dlg, fake_msg, ...)dlg_handlers.c(timeout) line 2555:run_dlg_callbacks(DLGCB_EXPIRED, dlg, fake_msg, ...)Only
dlg_replicated_delete()is missing this callback.Proposed fix
Add
run_dlg_callbacks(DLGCB_TERMINATED, ...)todlg_replicated_delete(), using a fake message and processing context (same pattern asdual_bye_event()indlg_req_within.c:257-270):This requires adding
#include "dlg_req_within.h"todlg_replication.c.Versions affected