This bug is what I believe is causing, or at least directly implicated in, the fault described in: commaai/openpilot#35788 (+ #102)
Bug Overview
The suspected bug is in how the driver describes the HTT TX fragment-descriptor pool to HELIUMPLUS. The host allocates the pool as separate page-sized DMA allocations in dma_pages[] (qdf_mem.c, htt_tx.c), but FRAG_DESC_BANK_CFG advertises one descriptor area starting at dma_pages[0], with one descriptor size and an ID range for the whole pool (htt_h2t.c, htt_h2t.c, htt_h2t.c).
The host finds descriptor id by page and offset, while the firmware API says HELIUMPLUS may derive the fragment descriptor from the configured base + the HTT descriptor ID (htt_tx.c, htt.h, htt.h).
Since each descriptor is 72 bytes, only 56 fit in a 4096-byte page; ID 56 is already the first point where the firmware-style address disagrees with the host’s real descriptor address. If firmware reads that wrong 72-byte area as an msdu_ext_desc_t, it can treat stale or unrelated bytes as packet fragment DMA addresses, leading the MAC DMA engine to read an unmapped IOVA and fault through the WLAN SMMU on SID 0x40.
HTT Frag Bank Fault Bug - Chain Of Events
HTT Frag Bank Fault Bug - Chain Of Events
I propose that these are the chain of events leading to the WLAN SSMU read fault. This is a writeup that combines source analysis with info obtained at runtime. Runtime details will be described in the investigation section towards the bottom.
-
WIFI2.0/HELIUMPLUS uses sizeof(struct msdu_ext_desc_t) for each frag descriptor. That structure carries six payload fragment entries; each entry stores a DMA address plus a length.
Source: htt_tx.c, htt_types.h, htt_types.h
-
The pool allocator does not allocate one packed DMA object. It allocates one coherent PAGE_SIZE object per page and keeps the real DMA address for each page in dma_pages[]. That matters because only dma_pages[0] is later advertised to firmware. It also means the bank model can be wrong even when the page IOVAs happen to be contiguous, because each page still has unused tail bytes after its last complete descriptor.
Source: qdf_mem.c, qdf_mem.c, htt_tx.c
-
FRAG_DESC_BANK_CFG is sent with one bank, global desc_size, bank base from dma_pages[0], min index 0, and max index pool_elems - 1.
Source: htt_h2t.c, htt_h2t.c, htt_h2t.c, htt_h2t.c, htt_h2t.c, htt_h2t.c
-
The firmware API documents the target-side lookup: the HTT TX descriptor id is used to calculate the fragmentation descriptor pointer from the configured base. The same API says WIFI2.0 hardware performs the mapping/translation instead of relying only on frags_desc_ptr.
Source: htt.h, htt.h, htt.h, htt.h, htt.h, htt.h
-
Pool setup uses the same loop index i for the HTT TX descriptor, the frag descriptor, and tx_desc.id. htt_tx_desc_init() then writes that same msdu_id into the HTT TX descriptor. In the inspected source, there is no remap between the host TX ID and the frag-desc pool index before TX publish.
Source: ol_txrx.c, ol_txrx.c, ol_txrx.c, ol_htt_tx_api.h, ol_htt_tx_api.h, htt_tx.c
-
The host finds the real descriptor through the page table: dma_pages[id / elems_per_page] + (id % elems_per_page) * desc_size. A target using the advertised single-bank contract would instead evaluate bank_base + id * desc_size.
Source: htt_tx.c, htt_tx.c, htt_tx.c, htt_tx.c
-
With desc_size = 0x48, a 4096-byte page holds 56 complete descriptors. The remaining 0x40 bytes at the end of each page are unused tail space, not a valid descriptor slot. The first descriptor-address mismatch is therefore ID 56: the target-style formula points to base + 0xfc0, while the host uses the first descriptor in dma_pages[1]. The natural crash boot showed page_linear=1 but desc_linear=0, so random page-to-page IOVA gaps are not required for the mismatch.
Source: qdf_mem.c, htt_tx.c, htt_tx.c, htt_tx.c, htt_tx.c, htt_tx.c
-
During TX preparation, the host clears the real msdu_ext_desc_t, terminates the fragment list, and fills payload fragment address/length entries from SKB fragments or TSO metadata.
Source: ol_tx.c, ol_tx.c, ol_tx.c, ol_tx.c, ol_tx.c, ol_tx.c, ol_htt_tx_api.h, ol_htt_tx_api.h
-
The explicit frags_desc_ptr is still written in the HTT descriptor, but the firmware API above says WIFI2.0 hardware may use the ID-to-bank mapping. That leaves the single-bank configuration relevant even though the host also writes the direct pointer.
Source: htt_tx.c, htt_tx.c, htt.h, htt.h, htt.h
-
For IDs at or above 56, an ID-based target lookup can read the unused tail bytes after page 0, shifted bytes from another descriptor, or a different mapped coherent page. On a page-linear boot, that wrong descriptor read may still hit mapped coherent memory and therefore not fault immediately. The diagnostic code computes both the host address and the target-style linear address, classifies where the linear address lands, and decodes any fragment address/length fields found there.
Source: htt_tx.c, htt_tx.c, htt_tx.c, htt_tx.c, htt_tx.c
-
If those wrong bytes decode as non-zero payload fragment entries, the MAC DMA has plausible source addresses and lengths to fetch. The diagnostic checks those decoded payload IOVAs against the WLAN SMMU domain and keeps recent decoded entries for later SMMU fault correlation.
Source: ol_htt_tx_api.h, ol_htt_tx_api.h, ol_htt_tx_api.h, htt_tx.c, htt_tx.c, htt_tx.c
-
TX completion releases payload DMA mappings. A delayed target read from a stale or shifted frag descriptor can therefore produce a WLAN SMMU read fault on a payload IOVA that was valid for an earlier packet but is no longer mapped.
Source: ol_tx_send.c, ol_tx_desc.c, ol_tx_desc.c, qdf_nbuf.c, qdf_nbuf.c
-
The observed +0x80 offset progression in FAR fits a payload-DMA fault. If it were a direct descriptor walk, you would expect to see the 0x48 descriptor size or by fields inside that descriptor.
- The proposed failure is two-stage instead: the bad ID-based lookup selects the wrong
0x48 bytes, those bytes decode as payload fragment address/length entries, and the MAC DMA then fetches from the decoded payload IOVA. At that point, the fault address follows payload-fetch behavior, not descriptor stride. With the default 128B MAC DMA burst setting, a fetch from an unmapped payload IOVA can fault at an initial packet offset such as ...c8 or ...d8, then advance to the next 128-byte boundary and continue in 0x80 steps.
Source: ol_htt_tx_api.h, ol_htt_tx_api.h, ol_htt_tx_api.h, target_if_def_config.h, target_if_def_config.h, wmi_unified.h, wmi_unified.h
- An unmapped WLAN IOVA reaches the ARM SMMU context-fault path. This path prints the fault IOVA, failed software translation, SID, and then BUGs for a fatal unhandled fault. In the natural crash capture,
base=0xa0911000, pool_elems=3600, and desc_size=72, so the packed advertised bank would end near 0xa0950480; the fatal FAR 0xa58c00d8 is outside that range. That rules out a direct read of the advertised frag bank for that crash and points to a later address derived from bad descriptor contents.
Source: arm-smmu.c, arm-smmu.c, arm-smmu.c, arm-smmu.c, arm-smmu.c, arm-smmu.c
Natural crash capture (post_flash_20260501_131549/pstore/console-ramoops-0):
HTT FRAG BANK DESC_GAP: page=1 first_index=56 expected_desc=0xa0911fc0 host_desc=0xa0912000 desc_size=72 slack_per_page=64
HTT FRAG BANK SUMMARY: base=0xa0911000 pages=65 pool_elems=3600 desc_size=72 elems_per_page=56 slack_per_page=64 page_linear=1 desc_linear=0 first_page_gap_index=65535 first_desc_gap_index=56
arm-smmu 15000000.apps-smmu: Unhandled context fault: iova=0xa58c00d8, fsr=0x40000402, fsynr=0x360003, cb=5
arm-smmu 15000000.apps-smmu: FAR = 00000000a58c00d8
arm-smmu 15000000.apps-smmu: soft iova-to-phys=0x0000000000000000
arm-smmu 15000000.apps-smmu: SID=0x40
Fast-path runtime evidence (fastpath_shadow_20260501_133508/htt_after.txt):
first_desc_gap_index=56
msdu_id=1199
host_iova=0xa090f678
linear_iova=0xa090f138
linear_class=known_frag_page
SHADOW_HOST frag0=0xa1516c02/135
SHADOW_LINEAR frag0=0xffc319ee2150/65535 frag1=0xffc319ee2150/65535
SHADOW_LINEAR_MAP frag0=unmapped/0x0 frag1=unmapped/0x0
This proves that normal traffic can reach the bad ID range and that the descriptor view can decode to unmapped payload IOVAs.
Things that led this bug being implicated:
- In trying to find the fault causing the #35788 issue, I basically tested a ton of injectors, working backwards to try to find one that matched the SID and fault syndrome of the natural fault.
- I tried these various faults:
CE posted-buffer revoke, HTT RX revoke / live-capture / fresh-buffer, RX ring-base / ring-page / page-target unmap, CE SID probes, NBUF / payload-shape injectors, pair-arm / explicit offset-gap shape injectors, publish-shape / control-poison injectors, RX target-revoke-after-selection, RX stale-after-pop delayed reuse, boot-gated real RX ring shadow publish, same-slot stale RX replay, TX read revoke, TSO read revoke but all of these had one of two issues. Either the SID was 0x41 instead of 0x40, or the fault syndrome was a write instead of read.
- I then made a "shape matcher" of sorts for the fault offsets of the natural SSMU faults I observed. It walked live WLAN allocations and tracked nbuf DMA mappings, then searched for embedded DMA addresses whose spacing looked like the natural SMMU fault shapes.
- The shape matcher showed that fault-shaped IOVA pairs were present inside live WLAN DMA-visible metadata, especially qdf_mem_multi_pages_alloc() objects used around HTT attach. Additionally, several hits came from HTT attach multi-page allocations and had 0x48 spacing, which pointed me toward TX/frag descriptors.
- I finally found an injector that matched the actual SSMU fault.
- I found that a published HTT TX descriptor pointing at an unmapped HTT fragment-descriptor page can produce the same core WLAN SMMU fault class as the natural crashes. And this matched the SID, FSR, FSYN, and SSMU block of the actual fault.
- The only discrepancy remaining with the HTT TX Descriptor injector was that it did not have a two-page fault shape (by design). I had a big question then, what object chain could naturally make “first page near +0xc8/+0xd8, then jump a big gap 0x25a0/0x2600, then walk +0x80”?
Discoveries
the frag-descriptor geometry bug
HTT Frag-Descriptor Bank Geometry Bug
The Host
-
Uses: sizeof(struct msdu_ext_desc_t) for HELIUMPLUS/WIFI2.0 frag descriptors.
Source: htt_tx.c
-
Size: 0x48 / 72 bytes = qdf_tso_flags_t + six 8-byte frag entries.
Source: qdf_types.h, htt_types.h
-
Allocated: via qdf_mem_multi_pages_alloc() -> one coherent PAGE_SIZE DMA buffer per page, not one packed array.
Source: htt_tx.c, qdf_mem.c, qdf_mem.c
-
Layout: 56 descriptors/page, then 64 unused bytes: 56 * 72 = 4032, leaving 4096 - 4032 = 64.
-
Host address: dma_pages[id / 56] + (id % 56) * 72.
Source: htt_tx.c
The Firmware
-
Told: one packed frag-descriptor bank.
-
Base: dma_pages[0].page_p_addr.
-
Range: 0..pool_elems - 1.
Source: htt_h2t.c
-
Format supports: bank bases + ID ranges.
Source: htt.h, htt.h
-
Firmware address: bank_base + id * 72.
The Failure
- IDs 0-55: host and firmware agree.
- ID 56: firmware lands at
bank_base + 0xfc0, the unused 64-byte tail of page 0.
- Real ID 56 descriptor: start of page 1.
- target uses TX descriptor
id for frag-descriptor lookup, so msdu_id >= 56 can resolve to the wrong memory.
Source: htt.h
The 0x80 offset problem
If firmware were simply walking frag descriptors directly, we expected fault addresses to move by 0x48, because msdu_ext_desc_t is 0x48 bytes. But the real SMMU FARs often did this:
...0xd8 -> ...0x100 -> later +0x80 -> +0x80 -> +0x80
A frag descriptor does not only contain metadata. It contains payload fragment address/length entries: each entry is a 48-bit physical address plus a 16-bit length. The TX path fills those entries from SKB fragment DMA addresses and lengths.
I looked all over for a 128-byte (0x80) TX descriptor/table in the host source and did not find a convincing one. I then foun a config saying the target’s default MAC DMA burst size is 128 bytes.
/* MAC DMA burst size. 0: 128B - default, 1: 256B, 2: 64B */
#define TGT_DEFAULT_DMA_BURST_SIZE 0
So that means the 0x80 is likely the hardware repeatedly trying to read packet data in 128-byte chunks, not firmware walking a table whose entries are 128 bytes wide. Source: target_if_def_config.h
So the SMMU fault is one step downstream from the original bug. The original bug is firmware finds the wrong fragment descriptor but the faulting address is wrong fragment descriptor -> wrong payload DMA address -> SMMU fault while fetching payload
Runtime Evidence
Runtime Frag-Bank Evidence (so far)
https://github.com/zappybiby/agnos-kernel-sdm845/tree/htt-frag-bank-layout-diag
post_flash_20260501_131549: shows the natural crash boot had the frag-bank bug active before the fault:
Note: Unfortunately, on this natural fault I didn't have the addition yet that would match SMMU FAR to bogus payload addresses/ranges. That is now implemented in that branch.
post_flash_20260501_131549.zip
Main findings from post_flash dump:
- Pool was 3600 descriptors, 65 pages, desc_size=72, elems_per_page=56.
- page_linear=1 but desc_linear=0, so the pages were contiguous, but the descriptor stream was still broken because each 4K page has 64 unused tail bytes.
- First bad descriptor was ID 56.
- Fatal SMMU FAR was 0xa58c00d8, SID 0x40, unmapped.
- Since the advertised packed frag bank was around 0xa0911000 to 0xa0950480, the fatal FAR was not a direct frag-bank read.
- this proves the geometry bug exists on a natural crash boot, does not depend on random non-contiguous page allocation, and the fault likely comes from a later derived address, not from reading the frag descriptor page itself.
fastpath_shadow_20260501_133508: normal TX reaches bad IDs and the target-style view decodes bad payload IOVAs.
fastpath_shadow_20260501_133508.zip
fastpath_shadow dump was made without the natural fault present, where it mainly serves to prove that ordinary TX actually enters the bad path:
- First observed fast-path publish was already msdu_id=1199, far beyond first bad ID 56.
- Host descriptor address was 0xa090f678; target-style linear address was 0xa090f138.
- Host view decoded sane payload metadata: frag0=0xa1516c02/135.
- Decoded bogus payload entries: frag0=0xffc319ee2150/65535, frag1=0xffc319ee2150/65535.
- Those decoded payload IOVAs were unmapped in the WLAN SMMU domain.
- it proves normal traffic can produce exactly the kind of poisoned payload pointer metadata needed by the theory, without needing reset/SSR/teardown first.
- our diagnostic read the descriptor from the address firmware would likely calculate, and that address pointed at the wrong bytes. When those wrong bytes were interpreted as packet-fragment fields, they contained invalid DMA addresses that were not mapped in the WLAN SMMU.
Reproducing
In my branch
Note: WLAN is DISABLED until one of these commands is run. This is so it takes effect before WLAN attaches (there is an easier way to handle WLAN config boot-persistence that avoids this, but I didn't want to build/flash userspace)
# repro mode
adb shell 'echo "freelist=-2 spacers=1 start" > /sys/kernel/boot_wlan/boot_wlan'
# normal/manual start
adb shell 'echo "start" > /sys/kernel/boot_wlan/boot_wlan'
Repro mode details:
htt_tx_freelist_start=-2 changes which TX IDs get used first. Normally the driver’s TX freelist may start at low IDs, so early traffic might use ID 0, 1, 2, etc. With -2, it rotates that freelist so the first normal packets use IDs just past the first broken frag-descriptor ID. Since the first broken ID is 56, traffic starts around ID 64 instead of ID 0.
- htt_tx_frag_bank_spacers=1 changes how frag descriptor pages are laid out in DMA space. The spacer pages are allocated then immediately freed it is not required for the bug to exist. It is an active stress/reproducer knob to make the bad bank-address assumption fail more obviously.
- Then start some wlan traffic, upload, unmetered, etc
- Technically, this does NOT guarantee a crash but it does make it very likely. It should fault either instantly or within a few minutes. If it doesn't, just reboot and try again.
This bug is what I believe is causing, or at least directly implicated in, the fault described in: commaai/openpilot#35788 (+ #102)
Bug Overview
The suspected bug is in how the driver describes the HTT TX fragment-descriptor pool to HELIUMPLUS. The host allocates the pool as separate page-sized DMA allocations in
dma_pages[](qdf_mem.c, htt_tx.c), butFRAG_DESC_BANK_CFGadvertises one descriptor area starting atdma_pages[0], with one descriptor size and an ID range for the whole pool (htt_h2t.c, htt_h2t.c, htt_h2t.c).The host finds descriptor
idby page and offset, while the firmware API says HELIUMPLUS may derive the fragment descriptor from theconfigured base + the HTT descriptor ID(htt_tx.c, htt.h, htt.h).Since each descriptor is 72 bytes, only 56 fit in a 4096-byte page; ID 56 is already the first point where the firmware-style address disagrees with the host’s real descriptor address. If firmware reads that wrong 72-byte area as an
msdu_ext_desc_t, it can treat stale or unrelated bytes as packet fragment DMA addresses, leading the MAC DMA engine to read an unmapped IOVA and fault through the WLAN SMMU on SID0x40.HTT Frag Bank Fault Bug - Chain Of Events
HTT Frag Bank Fault Bug - Chain Of Events
I propose that these are the chain of events leading to the WLAN SSMU read fault. This is a writeup that combines source analysis with info obtained at runtime. Runtime details will be described in the investigation section towards the bottom.
WIFI2.0/HELIUMPLUS uses
sizeof(struct msdu_ext_desc_t)for each frag descriptor. That structure carries six payload fragment entries; each entry stores a DMA address plus a length.Source: htt_tx.c, htt_types.h, htt_types.h
The pool allocator does not allocate one packed DMA object. It allocates one coherent
PAGE_SIZEobject per page and keeps the real DMA address for each page indma_pages[]. That matters because onlydma_pages[0]is later advertised to firmware. It also means the bank model can be wrong even when the page IOVAs happen to be contiguous, because each page still has unused tail bytes after its last complete descriptor.Source: qdf_mem.c, qdf_mem.c, htt_tx.c
FRAG_DESC_BANK_CFGis sent with one bank, globaldesc_size, bank base fromdma_pages[0], min index0, and max indexpool_elems - 1.Source: htt_h2t.c, htt_h2t.c, htt_h2t.c, htt_h2t.c, htt_h2t.c, htt_h2t.c
The firmware API documents the target-side lookup: the HTT TX descriptor
idis used to calculate the fragmentation descriptor pointer from the configured base. The same API says WIFI2.0 hardware performs the mapping/translation instead of relying only onfrags_desc_ptr.Source: htt.h, htt.h, htt.h, htt.h, htt.h, htt.h
Pool setup uses the same loop index
ifor the HTT TX descriptor, the frag descriptor, andtx_desc.id.htt_tx_desc_init()then writes that samemsdu_idinto the HTT TX descriptor. In the inspected source, there is no remap between the host TX ID and the frag-desc pool index before TX publish.Source: ol_txrx.c, ol_txrx.c, ol_txrx.c, ol_htt_tx_api.h, ol_htt_tx_api.h, htt_tx.c
The host finds the real descriptor through the page table:
dma_pages[id / elems_per_page] + (id % elems_per_page) * desc_size. A target using the advertised single-bank contract would instead evaluatebank_base + id * desc_size.Source: htt_tx.c, htt_tx.c, htt_tx.c, htt_tx.c
With
desc_size = 0x48, a 4096-byte page holds 56 complete descriptors. The remaining0x40bytes at the end of each page are unused tail space, not a valid descriptor slot. The first descriptor-address mismatch is therefore ID 56: the target-style formula points tobase + 0xfc0, while the host uses the first descriptor indma_pages[1]. The natural crash boot showedpage_linear=1butdesc_linear=0, so random page-to-page IOVA gaps are not required for the mismatch.Source: qdf_mem.c, htt_tx.c, htt_tx.c, htt_tx.c, htt_tx.c, htt_tx.c
During TX preparation, the host clears the real
msdu_ext_desc_t, terminates the fragment list, and fills payload fragment address/length entries from SKB fragments or TSO metadata.Source: ol_tx.c, ol_tx.c, ol_tx.c, ol_tx.c, ol_tx.c, ol_tx.c, ol_htt_tx_api.h, ol_htt_tx_api.h
The explicit
frags_desc_ptris still written in the HTT descriptor, but the firmware API above says WIFI2.0 hardware may use the ID-to-bank mapping. That leaves the single-bank configuration relevant even though the host also writes the direct pointer.Source: htt_tx.c, htt_tx.c, htt.h, htt.h, htt.h
For IDs at or above 56, an ID-based target lookup can read the unused tail bytes after page 0, shifted bytes from another descriptor, or a different mapped coherent page. On a page-linear boot, that wrong descriptor read may still hit mapped coherent memory and therefore not fault immediately. The diagnostic code computes both the host address and the target-style linear address, classifies where the linear address lands, and decodes any fragment address/length fields found there.
Source: htt_tx.c, htt_tx.c, htt_tx.c, htt_tx.c, htt_tx.c
If those wrong bytes decode as non-zero payload fragment entries, the MAC DMA has plausible source addresses and lengths to fetch. The diagnostic checks those decoded payload IOVAs against the WLAN SMMU domain and keeps recent decoded entries for later SMMU fault correlation.
Source: ol_htt_tx_api.h, ol_htt_tx_api.h, ol_htt_tx_api.h, htt_tx.c, htt_tx.c, htt_tx.c
TX completion releases payload DMA mappings. A delayed target read from a stale or shifted frag descriptor can therefore produce a WLAN SMMU read fault on a payload IOVA that was valid for an earlier packet but is no longer mapped.
Source: ol_tx_send.c, ol_tx_desc.c, ol_tx_desc.c, qdf_nbuf.c, qdf_nbuf.c
The observed
+0x80offset progression in FAR fits a payload-DMA fault. If it were a direct descriptor walk, you would expect to see the0x48descriptor size or by fields inside that descriptor.0x48bytes, those bytes decode as payload fragment address/length entries, and the MAC DMA then fetches from the decoded payload IOVA. At that point, the fault address follows payload-fetch behavior, not descriptor stride. With the default 128B MAC DMA burst setting, a fetch from an unmapped payload IOVA can fault at an initial packet offset such as...c8or...d8, then advance to the next 128-byte boundary and continue in0x80steps.Source: ol_htt_tx_api.h, ol_htt_tx_api.h, ol_htt_tx_api.h, target_if_def_config.h, target_if_def_config.h, wmi_unified.h, wmi_unified.h
base=0xa0911000,pool_elems=3600, anddesc_size=72, so the packed advertised bank would end near0xa0950480; the fatal FAR0xa58c00d8is outside that range. That rules out a direct read of the advertised frag bank for that crash and points to a later address derived from bad descriptor contents.Source: arm-smmu.c, arm-smmu.c, arm-smmu.c, arm-smmu.c, arm-smmu.c, arm-smmu.c
Natural crash capture (
post_flash_20260501_131549/pstore/console-ramoops-0):Fast-path runtime evidence (
fastpath_shadow_20260501_133508/htt_after.txt):This proves that normal traffic can reach the bad ID range and that the descriptor view can decode to unmapped payload IOVAs.
Things that led this bug being implicated:
CE posted-buffer revoke, HTT RX revoke / live-capture / fresh-buffer, RX ring-base / ring-page / page-target unmap, CE SID probes, NBUF / payload-shape injectors, pair-arm / explicit offset-gap shape injectors, publish-shape / control-poison injectors, RX target-revoke-after-selection, RX stale-after-pop delayed reuse, boot-gated real RX ring shadow publish, same-slot stale RX replay, TX read revoke, TSO read revokebut all of these had one of two issues. Either theSIDwas0x41instead of0x40, or the fault syndrome was awriteinstead ofread.Discoveries
the frag-descriptor geometry bug
HTT Frag-Descriptor Bank Geometry Bug
The Host
Uses:
sizeof(struct msdu_ext_desc_t)for HELIUMPLUS/WIFI2.0 frag descriptors.Source: htt_tx.c
Size:
0x48/ 72 bytes =qdf_tso_flags_t+ six 8-byte frag entries.Source: qdf_types.h, htt_types.h
Allocated: via
qdf_mem_multi_pages_alloc()-> one coherentPAGE_SIZEDMA buffer per page, not one packed array.Source: htt_tx.c, qdf_mem.c, qdf_mem.c
Layout: 56 descriptors/page, then 64 unused bytes:
56 * 72 = 4032, leaving4096 - 4032 = 64.Host address:
dma_pages[id / 56] + (id % 56) * 72.Source: htt_tx.c
The Firmware
Told: one packed frag-descriptor bank.
Base:
dma_pages[0].page_p_addr.Range:
0..pool_elems - 1.Source: htt_h2t.c
Format supports: bank bases + ID ranges.
Source: htt.h, htt.h
Firmware address:
bank_base + id * 72.The Failure
bank_base + 0xfc0, the unused 64-byte tail of page 0.idfor frag-descriptor lookup, somsdu_id >= 56can resolve to the wrong memory.Source: htt.h
The 0x80 offset problem
If firmware were simply walking frag descriptors directly, we expected fault addresses to move by 0x48, because msdu_ext_desc_t is 0x48 bytes. But the real SMMU FARs often did this:
...0xd8 -> ...0x100 -> later +0x80 -> +0x80 -> +0x80A frag descriptor does not only contain metadata. It contains payload fragment address/length entries: each entry is a 48-bit physical address plus a 16-bit length. The TX path fills those entries from SKB fragment DMA addresses and lengths.
I looked all over for a 128-byte (0x80) TX descriptor/table in the host source and did not find a convincing one. I then foun a config saying the target’s default MAC DMA burst size is 128 bytes.
So that means the 0x80 is likely the hardware repeatedly trying to read packet data in 128-byte chunks, not firmware walking a table whose entries are 128 bytes wide. Source: target_if_def_config.h
So the SMMU fault is one step downstream from the original bug. The original bug is
firmware finds the wrong fragment descriptorbut the faulting address iswrong fragment descriptor -> wrong payload DMA address -> SMMU fault while fetching payloadRuntime Evidence
Runtime Frag-Bank Evidence (so far)
https://github.com/zappybiby/agnos-kernel-sdm845/tree/htt-frag-bank-layout-diag
post_flash_20260501_131549: shows the natural crash boot had the frag-bank bug active before the fault:Note: Unfortunately, on this natural fault I didn't have the addition yet that would match SMMU FAR to bogus payload addresses/ranges. That is now implemented in that branch.
post_flash_20260501_131549.zip
Main findings from post_flash dump:
fastpath_shadow_20260501_133508: normal TX reaches bad IDs and the target-style view decodes bad payload IOVAs.fastpath_shadow_20260501_133508.zip
fastpath_shadow dump was made without the natural fault present, where it mainly serves to prove that ordinary TX actually enters the bad path:
Reproducing
In my branch
Note: WLAN is DISABLED until one of these commands is run. This is so it takes effect before WLAN attaches (there is an easier way to handle WLAN config boot-persistence that avoids this, but I didn't want to build/flash userspace)
Repro mode details:
htt_tx_freelist_start=-2changes which TX IDs get used first. Normally the driver’s TX freelist may start at low IDs, so early traffic might use ID 0, 1, 2, etc. With-2, it rotates that freelist so the first normal packets use IDs just past the first broken frag-descriptor ID. Since the first broken ID is 56, traffic starts around ID 64 instead of ID 0.