Skip to content

bluetooth-fw/nimble: fix discovery stop race causing KernelBG hang#1274

Open
gmarull wants to merge 1 commit into
coredevices:mainfrom
teslabs:analyze-FIRM-1895
Open

bluetooth-fw/nimble: fix discovery stop race causing KernelBG hang#1274
gmarull wants to merge 1 commit into
coredevices:mainfrom
teslabs:analyze-FIRM-1895

Conversation

@gmarull
Copy link
Copy Markdown
Member

@gmarull gmarull commented May 12, 2026

bt_driver_gatt_stop_discovery() could block forever on xSemaphoreTake(s_discovery_stopped, portMAX_DELAY) when a discovery completed naturally between the in-progress check and the flag set. KernelBG would then miss its watchdog and the watch reset into PRF.

Race sequence:

  1. KernelBG enters stop, reads s_discovery_in_progress = true.
  2. NimBLE host task fires the last discovery callback. Its top check sees s_stop_discovery_requested = false, so it takes the natural completion path, sets s_discovery_in_progress = false, and returns without giving the semaphore.
  3. KernelBG resumes, sets s_stop_discovery_requested = true, and blocks on the semaphore forever.

Fix by setting the stop flag before checking in-progress and by giving the semaphore from every discovery termination point (new prv_signal_discovery_done() helper) whenever a stop is pending. Drain any stale signal at the start of stop. Also reset the stop flag and set in-progress before issuing ble_gattc_disc_all_svcs() in bt_driver_gatt_start_discovery_range() so callbacks on the NimBLE host task cannot observe stale flags and silently abort the new discovery, and free the context on failure.

Fixes FIRM-1895

`bt_driver_gatt_stop_discovery()` could block forever on
`xSemaphoreTake(s_discovery_stopped, portMAX_DELAY)` when a discovery
completed naturally between the in-progress check and the flag set.
KernelBG would then miss its watchdog and the watch reset into PRF.

Race sequence:
 1. KernelBG enters stop, reads `s_discovery_in_progress = true`.
 2. NimBLE host task fires the last discovery callback. Its top check
    sees `s_stop_discovery_requested = false`, so it takes the natural
    completion path, sets `s_discovery_in_progress = false`, and
    returns without giving the semaphore.
 3. KernelBG resumes, sets `s_stop_discovery_requested = true`, and
    blocks on the semaphore forever.

Fix by setting the stop flag before checking in-progress and by giving
the semaphore from every discovery termination point (new
`prv_signal_discovery_done()` helper) whenever a stop is pending. Drain
any stale signal at the start of stop. Also reset the stop flag and set
in-progress before issuing `ble_gattc_disc_all_svcs()` in
`bt_driver_gatt_start_discovery_range()` so callbacks on the NimBLE host
task cannot observe stale flags and silently abort the new discovery,
and free the context on failure.

Fixes FIRM-1895

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Gerard Marull-Paretas <gerard@teslabs.com>
@gmarull gmarull requested a review from sjp4 May 12, 2026 16:13
@gmarull gmarull requested a review from jplexer as a code owner May 12, 2026 16:13
@gmarull
Copy link
Copy Markdown
Member Author

gmarull commented May 12, 2026

@sjp4 pls try

@sjp4
Copy link
Copy Markdown
Member

sjp4 commented May 12, 2026

@sjp4 pls try

I tried - got some more crashes (see https://linear.app/core-dev/issue/MOB-6961/crashlooped-to-prf) - I think it was on that version but not 100% sure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants