Skip to content

fix(python): skip stale protocol v1 responses#6860

Merged
romanz merged 1 commit into
mainfrom
romanz/2604/stale-v1
Jun 12, 2026
Merged

fix(python): skip stale protocol v1 responses#6860
romanz merged 1 commit into
mainfrom
romanz/2604/stale-v1

Conversation

@romanz

@romanz romanz commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Fixes #6859.

  • add a device test for HW CI
  • update UI fixtures

Was implemented in 49c9ad0.

Note to QA:

See #6859 (comment) for reproduction.

@trezor-bot trezor-bot Bot added this to Firmware Apr 30, 2026
@github-project-automation github-project-automation Bot moved this to 🔎 Needs review in Firmware Apr 30, 2026
@romanz romanz marked this pull request as ready for review April 30, 2026 16:51
@romanz romanz requested a review from matejcik as a code owner April 30, 2026 16:51
@romanz romanz requested review from Thalarion and mmilata April 30, 2026 16:52
@github-actions

github-actions Bot commented Apr 30, 2026

Copy link
Copy Markdown

en main(all)

model device_test click_test persistence_test
T2T1 test(all) main(all) test(all) main(all) test(all) main(all)
T3B1 test(all) main(all) test(all) main(all) test(all) main(all)
T3T1 test(all) main(all) test(all) main(all) test(all) main(all)
T3W1 test(all) main(all) test(all) main(all) test(all) main(all)
Translations

cs main(all)

model device_test click_test
T2T1 test(all) main(all) test(all) main(all)
T3B1 test(all) main(all) test(all) main(all)
T3T1 test(all) main(all) test(all) main(all)
T3W1 test(all) main(all) test(all) main(all)

de main(all)

model device_test click_test
T2T1 test(all) main(all) test(all) main(all)
T3B1 test(all) main(all) test(all) main(all)
T3T1 test(all) main(all) test(all) main(all)
T3W1 test(all) main(all) test(all) main(all)

es main(all)

model device_test click_test
T2T1 test(all) main(all) test(all) main(all)
T3B1 test(all) main(all) test(all) main(all)
T3T1 test(all) main(all) test(all) main(all)
T3W1 test(all) main(all) test(all) main(all)

fr main(all)

model device_test click_test
T2T1 test(all) main(all) test(all) main(all)
T3B1 test(all) main(all) test(all) main(all)
T3T1 test(all) main(all) test(all) main(all)
T3W1 test(all) main(all) test(all) main(all)

pt main(all)

model device_test click_test
T2T1 test(all) main(all) test(all) main(all)
T3B1 test(all) main(all) test(all) main(all)
T3T1 test(all) main(all) test(all) main(all)
T3W1 test(all) main(all) test(all) main(all)

Latest CI run: 27411049669

@coderabbitai

coderabbitai Bot commented Apr 30, 2026

Copy link
Copy Markdown

Walkthrough

This pull request prevents protocol-v1 desynchronization by making probe call sync_responses after a successful probe result to drain any queued/stale responses. sync_responses gained a new _cancel: bool = True parameter; callers can set _cancel=False (as probe does) to avoid emitting a messages.Cancel before draining. A changelog note was added instructing the system to skip stale protocol v1 responses.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant ProtocolV1
    participant Device

    Client->>ProtocolV1: start probe()
    ProtocolV1->>Device: send Initialize
    Device-->>ProtocolV1: Features / Failure

    alt Received Features
        ProtocolV1->>ProtocolV1: probe success
        ProtocolV1->>ProtocolV1: sync_responses(_cancel=false)
        ProtocolV1->>Device: (no Cancel sent) read/drain queued responses
        Device-->>ProtocolV1: stale responses (drained)
    else Received Failure
        ProtocolV1->>Client: return Failure (abort)
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The PR description is minimal and lacks key information required by the template such as development status, project assignment, and post-merge QA details. Add sections indicating: draft/final status (In Progress or Needs Review), project assignment details, sprint target, and whether QA testing is needed with instructions if testable.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main fix: skipping stale protocol v1 responses to prevent client desynchronization.
Linked Issues check ✅ Passed The code changes address the core requirement from issue #6859 by modifying protocol_v1.py to skip stale responses during probe and sync operations, directly fixing the desync problem described.
Out of Scope Changes check ✅ Passed All changes are focused on fixing the stale protocol v1 response issue: the changelog entry and modifications to sync_responses/probe logic are directly related to the linked issue.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch romanz/2604/stale-v1

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@python/src/trezorlib/protocol_v1.py`:
- Around line 362-363: The call to sync_responses(transport, _cancel=False)
inside probe() is not passing the probe's mapping argument and therefore falls
back to mapping.DEFAULT_MAPPING; update the call to forward the caller's mapping
(use the probe parameter named mapping) into sync_responses so it uses the same
protobuf map for encoding/decoding the Cancel/Ping pair; ensure you reference
the probe(mapping=...) parameter and pass mapping=mapping to sync_responses.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a3d3633e-323b-43d0-9152-988eef2510d9

📥 Commits

Reviewing files that changed from the base of the PR and between 89f3301 and dada916.

📒 Files selected for processing (2)
  • python/.changelog.d/6859.fixed
  • python/src/trezorlib/protocol_v1.py

Comment thread python/src/trezorlib/protocol_v1.py Outdated
@romanz romanz force-pushed the romanz/2604/stale-v1 branch from dada916 to c7e6e10 Compare April 30, 2026 17:10

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
python/src/trezorlib/protocol_v1.py (1)

348-363: 🏗️ Heavy lift

Please add a hardware regression test for the canceled-ping desync path.

Given this fix targets a transport/session desync edge case, a HW CI device test covering “interrupt ping → next Initialize returns Features (not stale Failure(ActionCancelled))” would significantly reduce regression risk.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/src/trezorlib/protocol_v1.py` around lines 348 - 363, Add a hardware
regression test that reproduces the canceled-ping desync path: from a real
device transport simulate an "interrupt ping" by writing messages.Cancel (using
the same ProtobufMapping encoding as probe) and consuming any in-flight
responses, then immediately call the Initialize flow and assert that the
Initialize response decodes to messages.Features (not a stale messages.Failure
with code messages.FailureType.ActionCancelled). Target the same primitives used
in probe/sync_responses (probe, sync_responses, write, read, mapping) so the
test explicitly exercises the Cancel → read/encode/decode sequence and verifies
the next Initialize returns Features, not Failure(ActionCancelled).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@python/src/trezorlib/protocol_v1.py`:
- Around line 348-363: Add a hardware regression test that reproduces the
canceled-ping desync path: from a real device transport simulate an "interrupt
ping" by writing messages.Cancel (using the same ProtobufMapping encoding as
probe) and consuming any in-flight responses, then immediately call the
Initialize flow and assert that the Initialize response decodes to
messages.Features (not a stale messages.Failure with code
messages.FailureType.ActionCancelled). Target the same primitives used in
probe/sync_responses (probe, sync_responses, write, read, mapping) so the test
explicitly exercises the Cancel → read/encode/decode sequence and verifies the
next Initialize returns Features, not Failure(ActionCancelled).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 75a1b616-cd89-4fe8-94f9-fa3b7fcf0086

📥 Commits

Reviewing files that changed from the base of the PR and between dada916 and c7e6e10.

📒 Files selected for processing (2)
  • python/.changelog.d/6859.fixed
  • python/src/trezorlib/protocol_v1.py
✅ Files skipped from review due to trivial changes (1)
  • python/.changelog.d/6859.fixed

@romanz romanz marked this pull request as draft May 4, 2026 07:51
@romanz

romanz commented May 4, 2026

Copy link
Copy Markdown
Contributor Author

Marking as draft until device test is added.

@romanz romanz force-pushed the romanz/2604/stale-v1 branch 2 times, most recently from ad1f595 to 67c688d Compare June 3, 2026 17:20
@romanz romanz marked this pull request as ready for review June 3, 2026 17:21
@romanz romanz requested a review from obrusvit as a code owner June 3, 2026 17:21
@romanz

romanz commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

This PR also resolves the de-synchronization scenario from #6651 (comment).

@romanz romanz added the translations Put this label on a PR to run tests in all languages label Jun 3, 2026
@romanz romanz marked this pull request as draft June 3, 2026 19:46
@romanz romanz force-pushed the romanz/2604/stale-v1 branch from 67c688d to 7807f2a Compare June 3, 2026 20:08
@romanz romanz marked this pull request as ready for review June 3, 2026 21:37
@romanz romanz force-pushed the romanz/2604/stale-v1 branch from 7807f2a to 58636d4 Compare June 8, 2026 12:33
@romanz romanz requested a review from M1nd3r June 8, 2026 12:47
@romanz

romanz commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Without the fix, running test_desync_v1 on T3T1 fails with:

_____________________________________________________________________________ test_desync_v1 ______________________________________________________________________________

client = <trezorlib.debuglink.TrezorTestContext object at 0x75b382752660>

    @pytest.mark.protocol("v1")
    def test_desync_v1(client: Client):
        with client.get_session(passphrase=None) as session:
            resp = session.call_raw(messages.Ping(message="test", button_protection=True))
            messages.ButtonRequest.ensure_isinstance(resp)
            session.write(messages.ButtonAck())
            session.debug.press_no()
            # don't read the response - simulating host disconnection
    
        # Creating a new client fails without skipping stale responses
        # (see https://github.com/trezor/trezor-firmware/issues/6859)
>       get_client(client.app, client.transport).ping("reconnect")

tests/device_tests/test_basic.py:89: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
python/src/trezorlib/client.py:492: in ping
    resp = session.call(
python/src/trezorlib/client.py:115: in call
    return self.client._call(self, msg, expect=expect, timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
python/src/trezorlib/client.py:508: in _call
    resp = session.call_raw(msg, timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
python/src/trezorlib/client.py:123: in call_raw
    return self.client._call_raw(self, msg, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
python/src/trezorlib/client.py:482: in _call_raw
    self._write(session, msg)
python/src/trezorlib/protocol_v1.py:316: in _write
    self._activate(session)
python/src/trezorlib/protocol_v1.py:285: in _activate
    session.initialize()
python/src/trezorlib/protocol_v1.py:137: in initialize
    features = messages.Features.ensure_isinstance(resp)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = <class 'trezorlib.messages.Features'>, msg = <Failure: {'code': <FailureType.ActionCancelled: 4>, 'message': 'Cancelled'}>

    @classmethod
    def ensure_isinstance(cls, msg: t.Any) -> tx.Self:
        """Ensure that the received `msg` is an instance of this class.
    
        If `msg` is not an instance of this class, raise an `UnexpectedMessageError`.
        otherwise, return it. This is useful for type-checking like so:
    
        >>> msg = client.call(SomeMessage())
        >>> if isinstance(msg, Foo):
        >>>     return msg.foo_attr  # attribute of Foo, type-checks OK
        >>> else:
        >>>     msg = Bar.ensure_isinstance(msg)  # raises if msg is something else
        >>>     return msg.bar_attr  # attribute of Bar, type-checks OK
    
        If there is just one expected message, you should use the `expect` parameter of
        `Client.call` instead.
        """
        if not isinstance(msg, cls):
>           raise UnexpectedMessageError(cls, msg)
E           trezorlib.exceptions.UnexpectedMessageError: Expected Features but Trezor sent <Failure: {'code': <FailureType.ActionCancelled: 4>, 'message': 'Cancelled'}>

python/src/trezorlib/protobuf.py:334: UnexpectedMessageError

With the fix, the test passes with:

[2026-06-08 14:51:36,276] trezorlib.protocol_v1 DEBUG: stale response: (17, b'\n\ttrezor.io\x10\x02\x18\x0c \x012\x18A7E2B2EFDB230D444169CDD38\x00@\x00J\x05en-USR\x04test`\x01j\x14\xbau\x8cR\r\xf2yV\x15\x01fA\xed\xab\x02\x9f^\xae\xa1\xaa\x80\x01\x01\x98\x01\x00\xa0\x01\x00\xaa\x01\x06Safe 5\xca\x01\x13UNSAFE, DO NOT USE!\xd8\x01\x00\xe0\x01\x00\xe8\x01\x00\xf0\x01\x01\xf0\x01\x02\xf0\x01\x03\xf0\x01\x04\xf0\x01\x05\xf0\x01\x07\xf0\x01\t\xf0\x01\x0b\xf0\x01\x0c\xf0\x01\r\xf0\x01\x18\xf0\x01\x0e\xf0\x01\x0f\xf0\x01\x10\xf0\x01\x11\xf0\x01\x12\xf0\x01\x13\xf0\x01\x15\xf0\x01\x14\xf8\x01\x00\x80\x02\x00\x88\x02\x00\x90\x02\x00\x9a\x02 \xb0P\xb9\xec\xce\x9c\x1a\x15|\xebZ*)\xc9:E\xb9\xa2\xfeoF\x83\x94\x04yw\xb2\xd7\x13:\xd2\xa9\xa0\x02\x00\xa8\x02\x00\xb0\x02\xc0\xcf$\xb8\x02\x00\xc0\x02\x00\xc8\x02\x00\xd0\x02\x02\xd8\x02\x00\xe2\x02\x04T3T1\xe8\x02\x02\xf0\x02\x00\xf8\x02\xf0\x01\x80\x03\xf0\x01\x88\x03\x00\x90\x03\x01\x98\x03\x01\xa0\x03\x01\xe8\x03\x00')

@mmilata mmilata left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting to look like sync_responses a bit. Should we limit the number of retries too?🤔

@romanz

romanz commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Starting to look like sync_responses a bit.

Indeed, but here it's used to make sure the next messages.Cancel() write wouldn't deadlock (in case the device is also blocked sending a stale response) on USB.

In THP, the deadlock shouldn't happen since the host also is responsive for incoming messages during writes:

# wait and return after receiving an ACK, or raise in case of an unexpected message / retransmission timeout.
await race(_wait_for_ack(), _write_loop())

Should we limit the number of retries too?🤔

Sounds good, added in d028d42.

@romanz

romanz commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Squashing and rebasing over main.

@romanz romanz force-pushed the romanz/2604/stale-v1 branch from d028d42 to 1edb54a Compare June 12, 2026 10:49
@romanz romanz merged commit 1edb54a into main Jun 12, 2026
170 checks passed
@romanz romanz deleted the romanz/2604/stale-v1 branch June 12, 2026 11:43
@trezor-bot trezor-bot Bot moved this from 🔎 Needs review to 🤝 Needs QA in Firmware Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

translations Put this label on a PR to run tests in all languages

Projects

Status: 🤝 Needs QA

Development

Successfully merging this pull request may close these issues.

Desync after trezorctl ping cancellation

2 participants