fix: resolve race condition in VMQ device deletion acknowledgment by davidebriani · Pull Request #119 · astarte-platform/astarte_vmq_plugin

davidebriani · 2025-11-07T14:44:38Z

Device deletion in Astarte involves a distributed coordination mechanism where the VMQ plugin must acknowledge deletion by setting vmq_ack=true in the database and sending a /f message to trigger final cleanup. The current implementation has a critical race condition that can cause device deletions to stall forever.

The root cause is that the ack_device_deletion/2 function executes operations in the wrong order:

Sends /f message to AMQP
Writes vmq_ack=true to database

This creates a race condition where:

/f message is published successfully
Database write fails (timeout, connection error, etc.)
DUP processes /f message and sets dup_end_ack=true
vmq_ack remains false, causing all_ack?() to return false
Device deletion stalls permanently with no retry mechanism

Additionally, the RPC server always returns :ok regardless of actual operation success, masking errors from callers.

The involves:

Reordering operations to ensure database write completes before message publication
Using proper error handling with early return on database failures
Propagating errors correctly through RPC server to enable caller retry logic

This ensures atomic behavior: either both operations succeed or both fail, eliminating the race condition that caused permanent deletion stalls.

Breaking change: RPC callers will now receive error responses for failed operations instead of false success indicators. However, error cases are already handled by the RPC client in Astarte since v1.2:
https://github.com/astarte-platform/astarte/blob/35c877efeece31a66576982f9fa30c00b4b801ea/apps/astarte_data_updater_plant/lib/astarte_data_updater_plant/data_updater/impl.ex#L2340

Annopaolo

Good catch! I left some very minor comments

Annopaolo · 2025-11-07T15:52:00Z

lib/astarte_vmq_plugin/rpc/server.ex

+    case Plugin.disconnect_client(client_id, true) do
+      :ok ->
+        :ok
+
+      # Client wasn't connected, that's fine
+      {:error, :not_found} ->
+        :ok
+    end


Since we're :oking it anyway, we can just remove the case

Suggested change

case Plugin.disconnect_client(client_id, true) do

:ok ->

:ok

# Client wasn't connected, that's fine

{:error, :not_found} ->

:ok

end

Plugin.disconnect_client(client_id, true)

Annopaolo · 2025-11-07T15:52:52Z

.tool-versions

@@ -1,2 +1,2 @@
 elixir 1.15.7-otp-26
-erlang 26.1
+erlang 26.1.2


How is this related to the fix?

Ops, nice catch

Annopaolo · 2025-11-07T15:54:46Z

CHANGELOG.md

+- Race condition in VMQ device deletion acknowledgment that could lead to deletion stalling
+  permanently.


We're already in Astarte VerneMQ, no need to repeat it

Suggested change

- Race condition in VMQ device deletion acknowledgment that could lead to deletion stalling

permanently.

- Corner case in device deletion acknowledgment that could lead to deletion stalling permanently.

codecov · 2025-11-07T17:02:37Z

Codecov Report

❌ Patch coverage is 0% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.75%. Comparing base (2abcf21) to head (a9f9c56).

Files with missing lines	Patch %	Lines
lib/astarte_vmq_plugin/rpc/server.ex	0.00%	5 Missing ⚠️
lib/astarte_vmq_plugin.ex	0.00%	4 Missing ⚠️

Additional details and impacted files

@@               Coverage Diff               @@
##           release-1.3     #119      +/-   ##
===============================================
- Coverage        79.53%   78.75%   -0.78%     
===============================================
  Files               13       13              
  Lines              303      306       +3     
===============================================
  Hits               241      241              
- Misses              62       65       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

noaccOS · 2025-11-07T14:52:14Z

lib/astarte_vmq_plugin.ex

+    else
+      {:error, reason} ->
+        {:error, reason}


nit

Suggested change

else

{:error, reason} ->

{:error, reason}

if the intent is to make the spec clear, I suggest adding

@spec ack_device_deletion(String.t(), Astarte.Core.Device.encoded_device_id()) :: :ok | {:error, term()}

Definitely, thanks!

noaccOS · 2025-11-10T07:56:37Z

lib/astarte_vmq_plugin.ex

+         {:ok, _} <- Queries.ack_device_deletion(realm_name, decoded_device_id) do
+      # Only send /f message after successful database write
+      timestamp = now_us_x10_timestamp()
+      publish_internal_message(realm_name, encoded_device_id, "/f", "", timestamp)


not related to this pr in particular, but since we're optimizing the process already:

I may be missing something here with how the deletion mechanism works, but if we're already replying with :ok if we've successfully acked the deletion, why do we need to also publish an internal message? can't dup just infer the deletion status from the vmq plugin response? that would make only one ack (dup "start" ack) relevant

You are raising a valid point.

The :ok reply from VerneMQ already confirms that the device was disconnected and the deletion was acknowledged, so DUP could rely on that instead of waiting for the AMQP message.
I think the extra message was originally kept for async consistency and to maintain the same event-driven pattern we used when all interactions went through the broker: I assume this way it's easier for DUP to realize when it has finished consuming additional messages that might have been in queue when progressing the device deletion.

In theory we can simplify the flow and drop the internal AMQP publish, relying solely on the GenServer reply.
My suggestion is to evaluate whether we need durability and replayability, e.g. if DUP crashes while doing the GenServer call to VerneMQ, and possibly how we wish to revisit the whole device deletion transaction.
Let's bring the conversation outside of this PR.

Device deletion in Astarte involves a distributed coordination mechanism where the VMQ plugin must acknowledge deletion by setting vmq_ack=true in the database and sending a /f message to trigger final cleanup. The current implementation has a critical race condition that can cause device deletions to stall forever. The root cause is that the ack_device_deletion/2 function executes operations in the wrong order: 1. Sends /f message to AMQP 2. Writes vmq_ack=true to database This creates a race condition where: - /f message is published successfully - Database write fails (timeout, connection error, etc.) - DUP processes /f message and sets dup_end_ack=true - vmq_ack remains false, causing all_ack?() to return false - Device deletion stalls permanently with no retry mechanism Additionally, the RPC server always returns :ok regardless of actual operation success, masking errors from callers. The involves: 1. Reordering operations to ensure database write completes before message publication 2. Using proper error handling with early return on database failures 3. Propagating errors correctly through RPC server to enable caller retry logic This ensures atomic behavior: either both operations succeed or both fail, eliminating the race condition that caused permanent deletion stalls. Breaking change: RPC callers will now receive error responses for failed operations instead of false success indicators. However, error cases are already handled in Astarte since v1.2: https://github.com/astarte-platform/astarte/blob/35c877efeece31a66576982f9fa30c00b4b801ea/apps/astarte_data_updater_plant/lib/astarte_data_updater_plant/data_updater/impl.ex#L2340 Signed-off-by: Davide Briani <davide.briani@secomind.com>

Annopaolo requested changes Nov 7, 2025

View reviewed changes

Annopaolo requested review from a team, mizzet1 and nedimtokic and removed request for a team November 7, 2025 16:04

davidebriani force-pushed the fix-device-deletion-error-handling branch from 2556cf5 to a9f9c56 Compare November 7, 2025 17:00

davidebriani force-pushed the fix-device-deletion-error-handling branch from a9f9c56 to 369dd9b Compare November 7, 2025 17:04

davidebriani requested a review from Annopaolo November 7, 2025 17:04

noaccOS approved these changes Nov 10, 2025

View reviewed changes

davidebriani force-pushed the fix-device-deletion-error-handling branch from 369dd9b to 075fbf5 Compare November 10, 2025 08:41

davidebriani requested a review from noaccOS November 10, 2025 08:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resolve race condition in VMQ device deletion acknowledgment#119

fix: resolve race condition in VMQ device deletion acknowledgment#119
davidebriani wants to merge 1 commit intoastarte-platform:release-1.3from
davidebriani:fix-device-deletion-error-handling

davidebriani commented Nov 7, 2025

Uh oh!

Annopaolo left a comment

Uh oh!

Annopaolo Nov 7, 2025

Uh oh!

Annopaolo Nov 7, 2025

Uh oh!

davidebriani Nov 7, 2025

Uh oh!

Annopaolo Nov 7, 2025

Uh oh!

codecov bot commented Nov 7, 2025

Uh oh!

noaccOS Nov 7, 2025

Uh oh!

davidebriani Nov 10, 2025

Uh oh!

noaccOS Nov 10, 2025

Uh oh!

davidebriani Nov 10, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		- Race condition in VMQ device deletion acknowledgment that could lead to deletion stalling
		permanently.

	- Race condition in VMQ device deletion acknowledgment that could lead to deletion stalling
	permanently.
	- Corner case in device deletion acknowledgment that could lead to deletion stalling permanently.

Conversation

davidebriani commented Nov 7, 2025

Uh oh!

Annopaolo left a comment

Choose a reason for hiding this comment

Uh oh!

Annopaolo Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Annopaolo Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

davidebriani Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Annopaolo Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Nov 7, 2025

Codecov Report

Uh oh!

noaccOS Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

davidebriani Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

noaccOS Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

davidebriani Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

davidebriani Nov 10, 2025 •

edited

Loading