Skip to content

ENG: Fix bad state loops in OSDP#45

Closed
kybishop wants to merge 1 commit into
mainfrom
ENG/fix-bad-state-loop
Closed

ENG: Fix bad state loops in OSDP#45
kybishop wants to merge 1 commit into
mainfrom
ENG/fix-bad-state-loop

Conversation

@kybishop
Copy link
Copy Markdown

@kybishop kybishop commented Apr 21, 2026

Background

Investigated a field incident where a reader LED went green after an access event and stayed stuck for ~34 minutes before recovering. During that window the reader was unresponsive to keypad input, and on recovery a backlog of buffered keypresses was delivered all at once.

Root cause analysis identified two separate bugs in Jeff. Most likely trigger was EM interference from the strike causing garbled bits.

Bug 1: crash on non-zero reserved bits in OSDP control byte

The OSDP spec defines the upper 4 bits of the control info byte as reserved (should be 0), but some peripheral firmware sends non-zero values there. One specific case: a peripheral was sending 0x53 — which happens to be the OSDP SOM byte — as its control byte.

ControlInfo.decode/1 was matching <<0::size(4), ...>>, which raised a MatchError on any byte with non-zero reserved bits, crashing the ACU GenServer.

Fix: Changed the match to _::size(4) to ignore reserved bits per spec. Added a try/rescue in ACU.handle_recv/2 (extracted into do_handle_recv/2) so that any future malformed frame logs a warning and returns {:error, :bad_frame} to the caller rather than crashing the process.

Bug 2: partial frame persisted indefinitely across poll timeouts

When a poll times out, Circuits.UART.read/2 returns {:ok, <<>>} and leaves the framing state unchanged. Because Jeff never sets rx_framing_timeout (defaulting to 0 = wait forever), the frame_timeout callback that would reset framing state is never invoked.

If a peripheral sends a garbled reply that produces a partial frame — for example, the SOM byte arriving without the rest of the message — that partial frame sits in the framing buffer indefinitely. Every subsequent poll times out (the framing module waits for bytes that complete the partial frame, not a new SOM), the sequence number never advances, and from the peripheral's perspective its replies are never acknowledged. It holds and retransmits unacknowledged replies, which explains the keypress backlog on recovery.

In the field incident, the ~34-minute window corresponded exactly to the time between the crash that created the bad state and the next crash that incidentally reset it.

Fix: Call UART.flush(uart, :receive) on every poll timeout. The framing module's flush(:receive, state) callback returns a clean %State{}, discarding any partial buffer. This ensures a corrupted framing state can never outlive a single poll cycle.

Tests

  • control_info_test.exs: reserved bits in control byte are ignored (including the 0x53 case)
  • message_test.exs: end-to-end Message.decode/1 tolerates non-zero reserved bits
  • framing_test.exs: flush(:receive, ...) discards partial buffer state

Copilot AI review requested due to automatic review settings April 21, 2026 21:15
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses two OSDP failure modes that can leave the ACU/peripheral pair stuck: (1) crashes from non-zero reserved bits in the control byte, and (2) indefinitely persisted partial framing state across poll timeouts.

Changes:

  • Make ControlInfo.decode/1 ignore the reserved upper nibble per OSDP spec.
  • Add defensive handling in ACU to avoid crashing on malformed frames (log + return error instead).
  • Flush UART receive/framing state on poll timeouts to prevent partial-frame poisoning across cycles.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
lib/jeff/control_info.ex Ignores reserved bits when decoding the control info byte.
lib/jeff/transport.ex Flushes UART receive state on read timeouts to reset partial framing.
lib/jeff/acu.ex Wraps decode/recv handling to prevent GenServer crashes on malformed frames.
test/control_info_test.exs Adds coverage for reserved-bit tolerance in control byte decoding.
test/message_test.exs Adds end-to-end decode coverage for reserved-bit tolerance.
test/framing_test.exs Adds coverage ensuring flush(:receive, ...) clears partial framing state.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/jeff/acu.ex Outdated
Comment thread test/message_test.exs Outdated
Comment thread test/framing_test.exs
Comment thread lib/jeff/transport.ex
Comment thread lib/jeff/acu.ex Outdated
@kybishop kybishop force-pushed the ENG/fix-bad-state-loop branch 2 times, most recently from af559b4 to 268a84f Compare April 21, 2026 21:41
Copilot AI review requested due to automatic review settings April 21, 2026 21:41
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/jeff/transport.ex
def handle_call({:recv, timeout}, _, %{uart: uart} = s) do
case UART.read(uart, timeout) do
{:ok, <<>>} ->
# Flush the receive buffer so any partial frame bytes accumulated during
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to use the rx_framing_timeout option (see Circuits.UART.open and Circuits.UART.Framing.frame_timeout).

That would cause Circuits.UART.read to return {:ok, {:partial, buffer}} after a framing timeout, and you wouldn't have to flush manually.

I don't think it would make much of a difference here though, so this is purely just FYI.

@kybishop kybishop force-pushed the ENG/fix-bad-state-loop branch 2 times, most recently from 29a5ee4 to 6d5083d Compare April 21, 2026 22:06
Copilot AI review requested due to automatic review settings April 21, 2026 22:06
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/message_test.exs
Comment thread lib/jeff/framing.ex Outdated
Comment thread lib/jeff/acu.ex
@kybishop kybishop force-pushed the ENG/fix-bad-state-loop branch from 6d5083d to 6b35bed Compare April 21, 2026 22:16
@kybishop
Copy link
Copy Markdown
Author

@kybishop kybishop closed this Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants