fix: resolve race condition in VMQ device deletion acknowledgment#119
fix: resolve race condition in VMQ device deletion acknowledgment#119davidebriani wants to merge 1 commit intoastarte-platform:release-1.3from
Conversation
Annopaolo
left a comment
There was a problem hiding this comment.
Good catch! I left some very minor comments
lib/astarte_vmq_plugin/rpc/server.ex
Outdated
| case Plugin.disconnect_client(client_id, true) do | ||
| :ok -> | ||
| :ok | ||
|
|
||
| # Client wasn't connected, that's fine | ||
| {:error, :not_found} -> | ||
| :ok | ||
| end |
There was a problem hiding this comment.
Since we're :oking it anyway, we can just remove the case
| case Plugin.disconnect_client(client_id, true) do | |
| :ok -> | |
| :ok | |
| # Client wasn't connected, that's fine | |
| {:error, :not_found} -> | |
| :ok | |
| end | |
| Plugin.disconnect_client(client_id, true) |
.tool-versions
Outdated
| @@ -1,2 +1,2 @@ | |||
| elixir 1.15.7-otp-26 | |||
| erlang 26.1 | |||
| erlang 26.1.2 | |||
There was a problem hiding this comment.
How is this related to the fix?
There was a problem hiding this comment.
Ops, nice catch
CHANGELOG.md
Outdated
| - Race condition in VMQ device deletion acknowledgment that could lead to deletion stalling | ||
| permanently. |
There was a problem hiding this comment.
We're already in Astarte VerneMQ, no need to repeat it
| - Race condition in VMQ device deletion acknowledgment that could lead to deletion stalling | |
| permanently. | |
| - Corner case in device deletion acknowledgment that could lead to deletion stalling permanently. |
2556cf5 to
a9f9c56
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release-1.3 #119 +/- ##
===============================================
- Coverage 79.53% 78.75% -0.78%
===============================================
Files 13 13
Lines 303 306 +3
===============================================
Hits 241 241
- Misses 62 65 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
a9f9c56 to
369dd9b
Compare
lib/astarte_vmq_plugin.ex
Outdated
| else | ||
| {:error, reason} -> | ||
| {:error, reason} |
There was a problem hiding this comment.
nit
| else | |
| {:error, reason} -> | |
| {:error, reason} |
if the intent is to make the spec clear, I suggest adding
@spec ack_device_deletion(String.t(), Astarte.Core.Device.encoded_device_id()) :: :ok | {:error, term()}There was a problem hiding this comment.
Definitely, thanks!
| {:ok, _} <- Queries.ack_device_deletion(realm_name, decoded_device_id) do | ||
| # Only send /f message after successful database write | ||
| timestamp = now_us_x10_timestamp() | ||
| publish_internal_message(realm_name, encoded_device_id, "/f", "", timestamp) |
There was a problem hiding this comment.
not related to this pr in particular, but since we're optimizing the process already:
I may be missing something here with how the deletion mechanism works, but if we're already replying with :ok if we've successfully acked the deletion, why do we need to also publish an internal message? can't dup just infer the deletion status from the vmq plugin response? that would make only one ack (dup "start" ack) relevant
There was a problem hiding this comment.
You are raising a valid point.
The :ok reply from VerneMQ already confirms that the device was disconnected and the deletion was acknowledged, so DUP could rely on that instead of waiting for the AMQP message.
I think the extra message was originally kept for async consistency and to maintain the same event-driven pattern we used when all interactions went through the broker: I assume this way it's easier for DUP to realize when it has finished consuming additional messages that might have been in queue when progressing the device deletion.
In theory we can simplify the flow and drop the internal AMQP publish, relying solely on the GenServer reply.
My suggestion is to evaluate whether we need durability and replayability, e.g. if DUP crashes while doing the GenServer call to VerneMQ, and possibly how we wish to revisit the whole device deletion transaction.
Let's bring the conversation outside of this PR.
Device deletion in Astarte involves a distributed coordination mechanism where the VMQ plugin must acknowledge deletion by setting vmq_ack=true in the database and sending a /f message to trigger final cleanup. The current implementation has a critical race condition that can cause device deletions to stall forever. The root cause is that the ack_device_deletion/2 function executes operations in the wrong order: 1. Sends /f message to AMQP 2. Writes vmq_ack=true to database This creates a race condition where: - /f message is published successfully - Database write fails (timeout, connection error, etc.) - DUP processes /f message and sets dup_end_ack=true - vmq_ack remains false, causing all_ack?() to return false - Device deletion stalls permanently with no retry mechanism Additionally, the RPC server always returns :ok regardless of actual operation success, masking errors from callers. The involves: 1. Reordering operations to ensure database write completes before message publication 2. Using proper error handling with early return on database failures 3. Propagating errors correctly through RPC server to enable caller retry logic This ensures atomic behavior: either both operations succeed or both fail, eliminating the race condition that caused permanent deletion stalls. Breaking change: RPC callers will now receive error responses for failed operations instead of false success indicators. However, error cases are already handled in Astarte since v1.2: https://github.com/astarte-platform/astarte/blob/35c877efeece31a66576982f9fa30c00b4b801ea/apps/astarte_data_updater_plant/lib/astarte_data_updater_plant/data_updater/impl.ex#L2340 Signed-off-by: Davide Briani <davide.briani@secomind.com>
369dd9b to
075fbf5
Compare
Device deletion in Astarte involves a distributed coordination mechanism where the VMQ plugin must acknowledge deletion by setting vmq_ack=true in the database and sending a /f message to trigger final cleanup. The current implementation has a critical race condition that can cause device deletions to stall forever.
The root cause is that the ack_device_deletion/2 function executes operations in the wrong order:
This creates a race condition where:
Additionally, the RPC server always returns :ok regardless of actual operation success, masking errors from callers.
The involves:
This ensures atomic behavior: either both operations succeed or both fail, eliminating the race condition that caused permanent deletion stalls.
Breaking change: RPC callers will now receive error responses for failed operations instead of false success indicators. However, error cases are already handled by the RPC client in Astarte since v1.2:
https://github.com/astarte-platform/astarte/blob/35c877efeece31a66576982f9fa30c00b4b801ea/apps/astarte_data_updater_plant/lib/astarte_data_updater_plant/data_updater/impl.ex#L2340