perf: 6 performance optimizations for STM32WB55 (Cortex-M4 @ 64MHz)#977
Draft
WonderMr wants to merge 12 commits into
Draft
perf: 6 performance optimizations for STM32WB55 (Cortex-M4 @ 64MHz)#977WonderMr wants to merge 12 commits into
WonderMr wants to merge 12 commits into
Conversation
1. Compiler: -Og → -Os (release builds)
Replace debug-level optimization with size-optimized -Os, which
enables most -O2 passes: function inlining, dead code elimination,
loop optimization, aggressive register allocation, tail call
optimization, and common subexpression elimination.
File: site_scons/firmwareopts.scons
2. Disable heap memset on free() in release
configHEAP_CLEAR_MEMORY_ON_FREE was always 1, causing FreeRTOS
Heap_4 to memset() every freed block to zero. Useful for catching
use-after-free in debug, but pure waste in release. Now conditional
on FURI_DEBUG. Saves ~500+ memset calls per second during active
GUI/protocol work.
File: targets/f7/inc/FreeRTOSConfig.h
3. SPI TX via DMA instead of busy-wait polling
furi_hal_spi_bus_tx() polled TXE flag byte-by-byte, keeping the
CPU in a tight loop for entire SPI transfers. Now delegates to
furi_hal_spi_bus_trx_dma() which uses DMA2_Channel7 and FreeRTOS
semaphore-based sleep, freeing CPU during display updates (~1KB/
frame @ 20fps) and radio TX operations.
File: targets/f7/furi_hal/furi_hal_spi.c
4. Fix realloc() to copy min(old_size, new_size)
Original realloc() copied `size` (new) bytes from old block via
memcpy, which reads past allocation when growing. Added
memmgr_heap_get_block_size() that reads usable size from Heap_4
BlockLink_t header. Now copies min(old_size, new_size) bytes,
fixing potential UB and reducing unnecessary copying.
Files: furi/core/memmgr.c, furi/core/memmgr_heap.c/.h,
targets/f7/api_symbols.csv
5. Fix calloc() to explicitly zero memory
Original calloc() just called pvPortMalloc() without memset,
relying on configHEAP_CLEAR_MEMORY_ON_FREE=1 for zero-initialized
returns. With optimization #2 disabling that in release, calloc()
would return uninitialized memory. Added explicit memset(0).
File: furi/core/memmgr.c
6. Branch prediction hints on furi_check/assert/break
Added __builtin_expect(!(__e), 0) to all assertion macros. Tells
GCC that error path is cold: crash code moves to end of function,
hot path becomes fall-through (0 pipeline penalty on Cortex-M4
3-stage pipeline). Affects ~2300+ call sites across the firmware.
File: furi/core/check.h
Also: strncpy → strlcpy in subghz_scene_save_name.c (-Os exposed
-Werror=stringop-truncation warning).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
b8150e9 to
63e81d4
Compare
realloc: add NULL check on pvPortMalloc result to prevent crash and preserve original allocation on OOM per C standard. Use memmgr_heap_get_block_size() to copy min(old, new) bytes instead of reading past the old allocation boundary. SPI DMA: TX-only path now sets up RX DMA channel draining into a dummy byte to prevent OVR accumulation on transfers >4 bytes. Pre-scheduler fallback correctly routes to furi_hal_spi_bus_tx() for TX-only ops. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
733b117 to
ce8a953
Compare
Member
|
|
…MHz) Ported from WonderMr/unleashed-firmware feat/opus-optimised and feat/cortex-m4-micro-optimizations branches. 1. Compiler: -Og → -Os for release builds (firmwareopts.scons) Enables -O2-level passes: inlining, dead code elimination, loop optimization, register allocation, tail call optimization, and CSE. 2. Disable heap memset on free() in release (FreeRTOSConfig.h) configHEAP_CLEAR_MEMORY_ON_FREE now conditional on FURI_DEBUG. Saves ~500+ memset calls/sec during active GUI/protocol work. 3. Fix calloc() to explicitly zero memory (memmgr.c) With optimization #2 disabling heap-clear in release, calloc() must memset(0) explicitly to guarantee zero-initialized returns. 4. Fix realloc() to copy min(old_size, new_size) bytes (memmgr.c, memmgr_heap.c/h, api_symbols.csv) Added memmgr_heap_get_block_size() to read usable size from Heap_4 BlockLink_t header. Also added NULL-guard on pvPortMalloc result to preserve original allocation on OOM. 5. Branch prediction hints on furi_check/assert/break (check.h) Added __builtin_expect(!(__e), 0) to all assertion macros. Crash code moves to end of function, hot path becomes fall-through. Affects ~2300+ call sites across the firmware. 6. SPI TX via DMA with RX drain (furi_hal_spi.c) furi_hal_spi_bus_tx() now delegates to DMA when scheduler is running, freeing CPU during display updates and radio TX. RX DMA channel drains into dummy byte to prevent OVR accumulation. 7. __attribute__((flatten)) on furi_get_tick() (kernel.c) Forces inlining of FreeRTOS wrappers at call sites, eliminating function call overhead on this very hot path. 8. __attribute__((flatten)) on hot thread functions (thread.c) Applied to furi_thread_get_current_id(), furi_thread_get_current(), and furi_thread_flags_get(). 9. In-place vprintf for furi_string_cat_vprintf() (string.c) Formats directly into destination buffer at current offset instead of allocating a temporary FuriString. Eliminates malloc+format+ memcpy+free per call. 10. Reduce configEXPECTED_IDLE_TIME_BEFORE_SLEEP 4 → 2 (FreeRTOSConfig.h) Allows FreeRTOS tickless idle to enter STOP mode more aggressively (2ms threshold instead of 4ms). Reduces average power consumption. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pvPortMalloc() in furi/core/memmgr_heap.c already memsets the returned buffer to zero (xToWipe = xWantedSize, line 467) regardless of configHEAP_CLEAR_MEMORY_ON_FREE. Calling memset() again in calloc() was a no-op. Reported by @WillyJL in #4360. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
furi_hal_spi.c (TX-only DMA path): On timeout the cleanup unconditionally released spi_dma_completed while LL_DMA_DisableIT_TC was issued *after*. A late or pending DMA completion ISR would then call furi_semaphore_release() on an already full binary semaphore and crash furi_check. Disable TC IRQ and clear the pending TC flag before releasing the semaphore so the ISR cannot double-release. memmgr_heap.c (memmgr_heap_get_block_size): Add heapVALIDATE_BLOCK_POINTER(pxLink) to match vPortFree(). Without it a caller passing an invalid pointer through this public API would read out of bounds before the configASSERT fires. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The TRX/RX branch (else of furi_hal_spi_bus_trx_dma) had the same race as the TX-only path fixed in b51e744a: on timeout the cleanup released spi_dma_completed before disabling LL_DMA_DisableIT_TC, so a late or pending DMA completion ISR would call furi_semaphore_release() on an already-full binary semaphore and crash furi_check. Pre-existing bug, not introduced by this PR — fixed for symmetry with the TX-only path now that the pattern is documented. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
furi/core/string.c (furi_string_cat_vprintf): The retry condition used >= which fired one extra vsnprintf when the formatted output fit exactly into the reserved capacity (NUL byte included). vsnprintf only truncates when size + 1 > buffer; change the predicate to match. furi/core/memmgr.c (realloc): Drop the unreachable NULL-guard around the copy/free. pvPortMalloc() calls furi_check(pvReturn, ...) on OOM (memmgr_heap.c:466) and crashes before returning, so p cannot be NULL after the call. The guard was dead code; the "preserve allocation on OOM" behavior advertised in the original commit message never actually triggered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The DMA path only reads from tx_buffer; nothing inside writes through the pointer. Mirrors the signature of furi_hal_spi_bus_trx() which already takes const uint8_t* tx_buffer. Drops the (uint8_t*) cast that furi_hal_spi_bus_tx() needed to call furi_hal_spi_bus_trx_dma() with its own const uint8_t* buffer parameter, and turns the (uint8_t*)&dma_dummy_u32 cast (the dummy buffer is itself const uint32_t) into a properly const-preserving (const uint8_t*) cast. api_symbols.csv updated to match. Existing in-tree callers (furi_hal_sd.c) pass non-const pointers and continue to compile without changes; out-of-tree callers passing const pointers no longer need to drop qualifiers. Reported by Copilot review on #4360. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
furi_hal_spi.c (TX-only and TRX/RX DMA paths, setup and cleanup): The TX channel TC flag (TC7) is set on transfer completion but its interrupt is not enabled or handled, so the flag was left latched. Cleared TC7 alongside the existing TC6 (RX) clear so the SPI DMA state is clean before/after each transfer, matching the pattern used by other DMA users in the codebase. Wrapped both clears in a single combined #if to keep the existing channel-mismatch guard. FreeRTOSConfig.h: Added a brief comment next to configHEAP_CLEAR_MEMORY_ON_FREE documenting the rationale for disabling wipe-on-free in release: pvPortMalloc() already zeros every allocated buffer (memmgr_heap.c xToWipe), so the next allocation cannot see stale data. The narrow exposure window between free() and the next reuse is acceptable under Flipper's threat model; code holding secrets is expected to zero its buffers explicitly before free(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The function only reads the block header before pv (xBlockSize) and does not modify either the block header or the pointed-to allocation. Switched the public API to const void* to match intent and to let callers pass const pointers without dropping qualifiers. Drop-in compatible: existing in-tree caller (memmgr.c realloc) passes a non-const void*, which converts implicitly. Reported by Copilot review on #4360. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
10 safe performance optimizations for STM32WB55 (Cortex-M4 @ 64MHz) + 1 Bugfix
Compiler & build:
Memory management (correctness + perf):
2. Disable heap memset on free() in release — configHEAP_CLEAR_MEMORY_ON_FREE is now conditional on FURI_DEBUG. Saves ~500+ memset calls/sec during active GUI/protocol work (FreeRTOSConfig.h)
3. Fix calloc() to explicitly zero memory — with heap-clear disabled in release, calloc() now does its own memset(0) to guarantee zero-initialized returns (memmgr.c)4. Fix realloc() to copy min(old_size, new_size) bytes — original copied size (new) bytes, reading past old allocation when growing. Added memmgr_heap_get_block_size() that reads usable size from Heap_4 BlockLink_t header. Also added heapVALIDATE_BLOCK_POINTER to the new public memmgr_heap_get_block_size() for parity with vPortFree().
Branch prediction:
5. __builtin_expect hints on furi_check/assert/break — crash code moves to end of function, hot path becomes fall-through (0 pipeline penalty on Cortex-M4 3-stage pipeline). Affects ~2300+ call sites across the firmware (check.h)
DMA:
6. SPI TX via DMA with RX drain — furi_hal_spi_bus_tx() now delegates to DMA when scheduler is running, freeing the CPU during display updates (~1KB/frame @ 20fps) and radio TX. TX-only path sets up RX DMA channel draining into a dummy byte to prevent OVR accumulation. Polling fallback preserved for pre-scheduler context (furi_hal_spi.c)
Also fixes a pre-existing race in the cleanup path: LL_DMA_DisableIT_TC was issued after furi_semaphore_release, allowing a late ISR to crash furi_check on a double-release. Now disables TC IRQ and clears TC flag before releasing the semaphore
Hot function inlining:
7. attribute((flatten)) on furi_get_tick() — forces inlining of FreeRTOS wrappers at call sites (kernel.c)
8. attribute((flatten)) on hot thread functions — applied to furi_thread_get_current_id(), furi_thread_get_current(), furi_thread_flags_get() (thread.c)
String formatting:
9. In-place vprintf for furi_string_cat_vprintf() — formats directly into destination buffer at current offset, growing only if needed. Eliminates temporary FuriString allocation (malloc + format + memcpy + free) per call (string.c)
Power:
10. Reduce configEXPECTED_IDLE_TIME_BEFORE_SLEEP from 4 to 2 ticks — allows FreeRTOS tickless idle to enter STOP mode more aggressively (2ms threshold instead of 4ms). Reduces average power consumption (FreeRTOSConfig.h)
Bugfix:
11. Fix DMA timeout race in furi_hal_spi_bus_trx_dma() - on timeout the cleanup released spi_dma_completed while LL_DMA_DisableIT_TC was issued after. A late or pending DMA completion ISR would then call furi_semaphore_release() on an already-full binary semaphore and crash furi_check. Disabled the TC IRQ and cleared the pending TC flag before releasing the semaphore (spi_dma_isr gates its release on LL_DMA_IsEnabledIT_TC).
Also: strncpy → strlcpy in subghz_scene_save_name.c (-Os exposed -Werror=stringop-truncation warning).
Verification
Checklist (For Reviewer)