ENG: Fix bad state loops in OSDP#45
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses two OSDP failure modes that can leave the ACU/peripheral pair stuck: (1) crashes from non-zero reserved bits in the control byte, and (2) indefinitely persisted partial framing state across poll timeouts.
Changes:
- Make
ControlInfo.decode/1ignore the reserved upper nibble per OSDP spec. - Add defensive handling in
ACUto avoid crashing on malformed frames (log + return error instead). - Flush UART receive/framing state on poll timeouts to prevent partial-frame poisoning across cycles.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| lib/jeff/control_info.ex | Ignores reserved bits when decoding the control info byte. |
| lib/jeff/transport.ex | Flushes UART receive state on read timeouts to reset partial framing. |
| lib/jeff/acu.ex | Wraps decode/recv handling to prevent GenServer crashes on malformed frames. |
| test/control_info_test.exs | Adds coverage for reserved-bit tolerance in control byte decoding. |
| test/message_test.exs | Adds end-to-end decode coverage for reserved-bit tolerance. |
| test/framing_test.exs | Adds coverage ensuring flush(:receive, ...) clears partial framing state. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
af559b4 to
268a84f
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def handle_call({:recv, timeout}, _, %{uart: uart} = s) do | ||
| case UART.read(uart, timeout) do | ||
| {:ok, <<>>} -> | ||
| # Flush the receive buffer so any partial frame bytes accumulated during |
There was a problem hiding this comment.
Another option would be to use the rx_framing_timeout option (see Circuits.UART.open and Circuits.UART.Framing.frame_timeout).
That would cause Circuits.UART.read to return {:ok, {:partial, buffer}} after a framing timeout, and you wouldn't have to flush manually.
I don't think it would make much of a difference here though, so this is purely just FYI.
29a5ee4 to
6d5083d
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
6d5083d to
6b35bed
Compare
|
Superseded by https://github.com/smartrent/cloud_bridge/pull/1204 |
Background
Investigated a field incident where a reader LED went green after an access event and stayed stuck for ~34 minutes before recovering. During that window the reader was unresponsive to keypad input, and on recovery a backlog of buffered keypresses was delivered all at once.
Root cause analysis identified two separate bugs in Jeff. Most likely trigger was EM interference from the strike causing garbled bits.
Bug 1: crash on non-zero reserved bits in OSDP control byte
The OSDP spec defines the upper 4 bits of the control info byte as reserved (should be 0), but some peripheral firmware sends non-zero values there. One specific case: a peripheral was sending
0x53— which happens to be the OSDP SOM byte — as its control byte.ControlInfo.decode/1was matching<<0::size(4), ...>>, which raised aMatchErroron any byte with non-zero reserved bits, crashing the ACU GenServer.Fix: Changed the match to
_::size(4)to ignore reserved bits per spec. Added a try/rescue inACU.handle_recv/2(extracted intodo_handle_recv/2) so that any future malformed frame logs a warning and returns{:error, :bad_frame}to the caller rather than crashing the process.Bug 2: partial frame persisted indefinitely across poll timeouts
When a poll times out,
Circuits.UART.read/2returns{:ok, <<>>}and leaves the framing state unchanged. Because Jeff never setsrx_framing_timeout(defaulting to 0 = wait forever), theframe_timeoutcallback that would reset framing state is never invoked.If a peripheral sends a garbled reply that produces a partial frame — for example, the SOM byte arriving without the rest of the message — that partial frame sits in the framing buffer indefinitely. Every subsequent poll times out (the framing module waits for bytes that complete the partial frame, not a new SOM), the sequence number never advances, and from the peripheral's perspective its replies are never acknowledged. It holds and retransmits unacknowledged replies, which explains the keypress backlog on recovery.
In the field incident, the ~34-minute window corresponded exactly to the time between the crash that created the bad state and the next crash that incidentally reset it.
Fix: Call
UART.flush(uart, :receive)on every poll timeout. The framing module'sflush(:receive, state)callback returns a clean%State{}, discarding any partial buffer. This ensures a corrupted framing state can never outlive a single poll cycle.Tests
control_info_test.exs: reserved bits in control byte are ignored (including the0x53case)message_test.exs: end-to-endMessage.decode/1tolerates non-zero reserved bitsframing_test.exs:flush(:receive, ...)discards partial buffer state