Skip to content

fix: resolve race condition in VMQ device deletion acknowledgment#119

Open
davidebriani wants to merge 1 commit intoastarte-platform:release-1.3from
davidebriani:fix-device-deletion-error-handling
Open

fix: resolve race condition in VMQ device deletion acknowledgment#119
davidebriani wants to merge 1 commit intoastarte-platform:release-1.3from
davidebriani:fix-device-deletion-error-handling

Conversation

@davidebriani
Copy link
Copy Markdown
Collaborator

Device deletion in Astarte involves a distributed coordination mechanism where the VMQ plugin must acknowledge deletion by setting vmq_ack=true in the database and sending a /f message to trigger final cleanup. The current implementation has a critical race condition that can cause device deletions to stall forever.

The root cause is that the ack_device_deletion/2 function executes operations in the wrong order:

  1. Sends /f message to AMQP
  2. Writes vmq_ack=true to database

This creates a race condition where:

  • /f message is published successfully
  • Database write fails (timeout, connection error, etc.)
  • DUP processes /f message and sets dup_end_ack=true
  • vmq_ack remains false, causing all_ack?() to return false
  • Device deletion stalls permanently with no retry mechanism

Additionally, the RPC server always returns :ok regardless of actual operation success, masking errors from callers.

The involves:

  1. Reordering operations to ensure database write completes before message publication
  2. Using proper error handling with early return on database failures
  3. Propagating errors correctly through RPC server to enable caller retry logic

This ensures atomic behavior: either both operations succeed or both fail, eliminating the race condition that caused permanent deletion stalls.

Breaking change: RPC callers will now receive error responses for failed operations instead of false success indicators. However, error cases are already handled by the RPC client in Astarte since v1.2:
https://github.com/astarte-platform/astarte/blob/35c877efeece31a66576982f9fa30c00b4b801ea/apps/astarte_data_updater_plant/lib/astarte_data_updater_plant/data_updater/impl.ex#L2340

Copy link
Copy Markdown
Collaborator

@Annopaolo Annopaolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I left some very minor comments

Comment on lines +75 to +82
case Plugin.disconnect_client(client_id, true) do
:ok ->
:ok

# Client wasn't connected, that's fine
{:error, :not_found} ->
:ok
end
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're :oking it anyway, we can just remove the case

Suggested change
case Plugin.disconnect_client(client_id, true) do
:ok ->
:ok
# Client wasn't connected, that's fine
{:error, :not_found} ->
:ok
end
Plugin.disconnect_client(client_id, true)

.tool-versions Outdated
@@ -1,2 +1,2 @@
elixir 1.15.7-otp-26
erlang 26.1
erlang 26.1.2
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this related to the fix?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ops, nice catch

CHANGELOG.md Outdated
Comment on lines +12 to +13
- Race condition in VMQ device deletion acknowledgment that could lead to deletion stalling
permanently.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're already in Astarte VerneMQ, no need to repeat it

Suggested change
- Race condition in VMQ device deletion acknowledgment that could lead to deletion stalling
permanently.
- Corner case in device deletion acknowledgment that could lead to deletion stalling permanently.

@Annopaolo Annopaolo requested review from a team, mizzet1 and nedimtokic and removed request for a team November 7, 2025 16:04
@davidebriani davidebriani force-pushed the fix-device-deletion-error-handling branch from 2556cf5 to a9f9c56 Compare November 7, 2025 17:00
@codecov
Copy link
Copy Markdown

codecov bot commented Nov 7, 2025

Codecov Report

❌ Patch coverage is 0% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.75%. Comparing base (2abcf21) to head (a9f9c56).

Files with missing lines Patch % Lines
lib/astarte_vmq_plugin/rpc/server.ex 0.00% 5 Missing ⚠️
lib/astarte_vmq_plugin.ex 0.00% 4 Missing ⚠️
Additional details and impacted files
@@               Coverage Diff               @@
##           release-1.3     #119      +/-   ##
===============================================
- Coverage        79.53%   78.75%   -0.78%     
===============================================
  Files               13       13              
  Lines              303      306       +3     
===============================================
  Hits               241      241              
- Misses              62       65       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@davidebriani davidebriani force-pushed the fix-device-deletion-error-handling branch from a9f9c56 to 369dd9b Compare November 7, 2025 17:04
Comment on lines +192 to +194
else
{:error, reason} ->
{:error, reason}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
else
{:error, reason} ->
{:error, reason}

if the intent is to make the spec clear, I suggest adding

@spec ack_device_deletion(String.t(), Astarte.Core.Device.encoded_device_id()) :: :ok | {:error, term()}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, thanks!

{:ok, _} <- Queries.ack_device_deletion(realm_name, decoded_device_id) do
# Only send /f message after successful database write
timestamp = now_us_x10_timestamp()
publish_internal_message(realm_name, encoded_device_id, "/f", "", timestamp)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to this pr in particular, but since we're optimizing the process already:

I may be missing something here with how the deletion mechanism works, but if we're already replying with :ok if we've successfully acked the deletion, why do we need to also publish an internal message? can't dup just infer the deletion status from the vmq plugin response? that would make only one ack (dup "start" ack) relevant

Copy link
Copy Markdown
Collaborator Author

@davidebriani davidebriani Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are raising a valid point.

The :ok reply from VerneMQ already confirms that the device was disconnected and the deletion was acknowledged, so DUP could rely on that instead of waiting for the AMQP message.
I think the extra message was originally kept for async consistency and to maintain the same event-driven pattern we used when all interactions went through the broker: I assume this way it's easier for DUP to realize when it has finished consuming additional messages that might have been in queue when progressing the device deletion.

In theory we can simplify the flow and drop the internal AMQP publish, relying solely on the GenServer reply.
My suggestion is to evaluate whether we need durability and replayability, e.g. if DUP crashes while doing the GenServer call to VerneMQ, and possibly how we wish to revisit the whole device deletion transaction.
Let's bring the conversation outside of this PR.

Device deletion in Astarte involves a distributed coordination mechanism where
the VMQ plugin must acknowledge deletion by setting vmq_ack=true in the database
and sending a /f message to trigger final cleanup. The current implementation has
a critical race condition that can cause device deletions to stall forever.

The root cause is that the ack_device_deletion/2 function executes
operations in the wrong order:
1. Sends /f message to AMQP
2. Writes vmq_ack=true to database

This creates a race condition where:
- /f message is published successfully
- Database write fails (timeout, connection error, etc.)
- DUP processes /f message and sets dup_end_ack=true
- vmq_ack remains false, causing all_ack?() to return false
- Device deletion stalls permanently with no retry mechanism

Additionally, the RPC server always returns :ok regardless of actual operation
success, masking errors from callers.

The involves:
1. Reordering operations to ensure database write completes before
message publication
2. Using proper error handling with early return on database failures
3. Propagating errors correctly through RPC server to enable caller
retry logic

This ensures atomic behavior: either both operations succeed or both fail,
eliminating the race condition that caused permanent deletion stalls.

Breaking change: RPC callers will now receive error responses for failed
operations instead of false success indicators. However, error cases are
already handled in Astarte since v1.2:
https://github.com/astarte-platform/astarte/blob/35c877efeece31a66576982f9fa30c00b4b801ea/apps/astarte_data_updater_plant/lib/astarte_data_updater_plant/data_updater/impl.ex#L2340

Signed-off-by: Davide Briani <davide.briani@secomind.com>
@davidebriani davidebriani force-pushed the fix-device-deletion-error-handling branch from 369dd9b to 075fbf5 Compare November 10, 2025 08:41
@davidebriani davidebriani requested a review from noaccOS November 10, 2025 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants