Skip to content

Commit 29c7749

Browse files
robbederksclaude
andcommitted
usb4: restore full b8db CDR-margin loop + add patch_phylock.py stock PHY tracer
b8db: replace the simplified bit6-only early-exit with the full wrsweda2r-verified body (prologue early-returns + per-lane margin window + the CDR-margin SUBB compare on C2D2/C2D9/C2DA/C352/C359/C35A, e9e7 RxPLL-reset on any miss, bounded 10). Faithful. patch_phylock.py: change-gated stock code-cave (hooks the super-loop top 0x2FC0) dumping [P:<C8FF><E302><E762><SBa0><SBa1><0779><077A><E764><C2D0><C350>]. This is the host-vs-fw discriminator that produced the breakthrough below. FINDINGS (stock vs handmade, via patch_phylock + the pll= diag added to u4lb_s5_diag): - Stock REACHES THE GPU: [PcieTunnel-Enable] -> USB4 Gen3 x2 -> PCIE Gen04 x04 -> Bus#2D. - The E762/RXPLL hypothesis was WRONG: stock also has E762=00 post-train (E762 bit5 is set only DURING the train). Not the gate. - The real gate is the CDR lock: stock C2D0/C350 go E2 -> 64 (bit6 PLL-lock, post-train) -> F4 (bit4 added = full CDR lock) AS the lane comes up SB[0xA1] 07->01, THEN 0779 populates with bit7-set CL responses (AD/A8/A0) and the lanes reach CL0 (02). - Handmade is stuck at C2D0=E4 / C350=64 (bit6 set, bit4 CLEAR) -> CDR never fully locks -> lane never comes up -> 0779 stays 0 -> no CL0. And handmade C2D0=E4 has bit7 SET which stock's post-train 0x64 does NOT -> a residual STATE-4 PHY-config divergence (not b8db: restoring the full b8db margin loop left C2D0 unchanged at E4). NEXT: instrument stock's C2D0/C350 through the state-4 cdc6/e305 PHY train to find where the handmade diverges (C2D0 bit7) -- the lane-up gate is upstream of b8db/the walker. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 9ec5078 commit 29c7749

2 files changed

Lines changed: 318 additions & 13 deletions

File tree

app/patch_phylock.py

Lines changed: 285 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,285 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Patch stock fw_tinygrad.bin to UART-dump a TIMELINE of the USB4 PHY RXPLL-lock /
4+
CL-snap state, sampled in the super-loop and logged ONLY ON CHANGE (no flood).
5+
6+
WHY: the handmade fw reaches state-5 with E764=0x19 (PHY train sequence complete)
7+
but E762==0 throughout (E762.4 RXPLL-ready never holds) and the state-5 CL-walker's
8+
snap byte 0x0779+lane never gets bit7 set -> the host never posts the CL response and
9+
the lanes stay SB[0xA0]/[0xA1]==0x07. Hypothesis: stock HOLDS the RXPLL lock (E762.4)
10+
and so gets a populated 0x0779 (bit7 set), while the handmade does not (suspect the
11+
d436 lane-ramp PHY settles + b8db CDR-margin we simplified). This trace captures, on
12+
STOCK, the timeline of E762 (RXPLL) + 0x0779/0x077A (CL snap) + C2D0/C350 (PLL lock)
13+
alongside the rate/mode/lane state, to confirm whether stock holds E762.4 lock and
14+
populates 0x0779 -- the host-vs-fw discriminator before we invest in the d436/b8db
15+
faithful completion.
16+
17+
HOOK SITE: identical to patch_lanetrace.py -- main_boot_and_superloop @0x2FB4 body
18+
top `mov dptr,#0x0AE2` (3 bytes) at 0x2FC0, right after `clr EA` (interrupts already
19+
disabled -> atomic DPX-paged reads). We replace it with `lcall CAVE`; the cave samples,
20+
prints only on a changed signature, then replays `mov dptr,#0x0AE2` and returns.
21+
22+
CHANGE SIGNATURE (1 byte @ XDATA 0x0BFE, a free working-RAM byte):
23+
C8FF ^ E302 ^ E762 ^ SB[0xA0] ^ SB[0xA1] ^ 0x0779 ^ 0x077A
24+
so a new line prints whenever the rate, link-mode, RXPLL-ready, lanes, OR the CL snap
25+
changes -> a compact timeline of the lock + snap evolution.
26+
27+
STOCK'S OWN PRINTS ARE KEPT (only the super-loop top is touched), so [SB Con]/[SB P0x]/
28+
[PcieTunnel-*]/[*** USB4 Gen3 x2 ***] interleave with [P:] lines for time-alignment.
29+
30+
OUTPUT LINE (hex, DPX-restored before each print):
31+
[P:<C8FF><E302><E762><SBa0><SBa1><0779><077A><E764><C2D0><C350>]
32+
rate mode rxpll lane lane snap0 snap1 train pll pll
33+
Key reads: E762.4 (RXPLL-ready) should be SET if stock holds lock; 0779 bit7 SET +
34+
bit4 CLEAR is the CL-ready snap that drives the walker's ea7c CL0 emit; C2D0/C350
35+
bit6 are the b8db PLL-lock poll bits.
36+
37+
Addressing: ghidra.c == this image. body = data[4:-6] (wrapped). Cave/prefix live in
38+
the low shared zero region (<0x8000, flat both banks). SB[off] = paged XDATA 0x2800+off
39+
(DPX=1); plain XDATA via DPX=0.
40+
41+
Build:
42+
python3 app/patch_phylock.py fw_tinygrad.bin /tmp/fw_phylock.bin
43+
"""
44+
45+
import argparse
46+
import zlib
47+
from pathlib import Path
48+
49+
PROJECT_ROOT = Path(__file__).resolve().parent.parent
50+
DEFAULT_IN = PROJECT_ROOT / "fw_tinygrad.bin"
51+
DEFAULT_OUT = PROJECT_ROOT / "fw_tinygrad_phylock.bin"
52+
53+
# Stock UART helpers (low shared region, <0x8000 -> flat in both banks).
54+
UART_PUTS = 0x538D
55+
UART_PUTHEX = 0x51C7
56+
UART_TX = 0xC001
57+
58+
# Super-loop body top: `mov dptr,#0x0AE2` (3 bytes) right after `clr EA`.
59+
HOOK_SITE_BODY = 0x2FC0
60+
HOOK_SITE_OLD = bytes.fromhex("900ae2") # mov dptr,#0x0AE2
61+
62+
# Caves in the low shared zero region (flat both banks), <0x8000.
63+
CAVE = 0x5E00
64+
PREFIX_ADDR = 0x5FC0
65+
PREFIX = b"[P:\x00"
66+
67+
# Persistent change-signature byte (free working-RAM byte, plain XDATA DPX=0).
68+
SIG_ADDR = 0x0BFE
69+
70+
# Plain-XDATA (DPX=0) registers logged, in print order.
71+
# C8FF lane-rate, E302 link-mode, E762 RXPLL-ready, then (SB lane bytes),
72+
# 0x0779/0x077A CL snap, E764 train, C2D0/C350 PLL-lock.
73+
PLAIN_PRE = [0xC8FF, 0xE302, 0xE762] # before the SB lane bytes
74+
PLAIN_POST = [0x0779, 0x077A, 0xE764, 0xC2D0, 0xC350]
75+
76+
# SB page-1 lane-state bytes (DPX=1, XDATA 0x2800+off).
77+
SB_DPX = 0x01
78+
SB_LANE = [0x28A0, 0x28A1] # SB[0xA0], SB[0xA1]
79+
80+
# XDATA addrs folded into the change-signature (plain DPX=0).
81+
SIG_PLAIN = [0xC8FF, 0xE302, 0xE762, 0x0779, 0x077A]
82+
83+
84+
def lcall(addr):
85+
return bytes([0x12, (addr >> 8) & 0xFF, addr & 0xFF])
86+
87+
88+
def mov_dptr(addr):
89+
return bytes([0x90, (addr >> 8) & 0xFF, addr & 0xFF])
90+
91+
92+
def mov_a_imm(val):
93+
return bytes([0x74, val & 0xFF])
94+
95+
96+
def emit_char(ch):
97+
if isinstance(ch, str):
98+
ch = ord(ch)
99+
return mov_dptr(UART_TX) + mov_a_imm(ch) + b"\xf0"
100+
101+
102+
def puthex_xdata(addr):
103+
# DPX=0 plain XDATA read into R7, then print.
104+
return mov_dptr(addr) + b"\xe0\xff" + lcall(UART_PUTHEX)
105+
106+
107+
def puthex_sb(addr):
108+
# DPX=1 paged read, restore DPX=0, then print (the print's movx is DPX-sensitive).
109+
return (
110+
bytes([0x75, 0x93, SB_DPX]) # mov DPX,#1
111+
+ mov_dptr(addr) + b"\xe0\xff" # movx a,@dptr ; mov r7,a
112+
+ bytes([0x75, 0x93, 0x00]) # mov DPX,#0
113+
+ lcall(UART_PUTHEX)
114+
)
115+
116+
117+
def puts_code(addr):
118+
return (
119+
bytes([0x7B, 0xFF])
120+
+ bytes([0x7A, (addr >> 8) & 0xFF])
121+
+ bytes([0x79, addr & 0xFF])
122+
+ lcall(UART_PUTS)
123+
)
124+
125+
126+
# Scratch IRAM bytes used only inside the hook (saved/restored).
127+
SIG_SCRATCH = 0x22
128+
TMP_SCRATCH = 0x23
129+
130+
# Preserve ACC,B,DPL,DPH,PSW,R0-R7,DPX + the scratch bytes we touch.
131+
PRESERVE = (0xE0, 0xF0, 0x82, 0x83, 0xD0,
132+
0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
133+
0x93, SIG_SCRATCH, TMP_SCRATCH)
134+
135+
136+
def build_signature():
137+
"""sig = xor(SIG_PLAIN) ^ SB[0xA0] ^ SB[0xA1] into 0x22."""
138+
code = bytearray()
139+
first, rest = SIG_PLAIN[0], SIG_PLAIN[1:]
140+
code += mov_dptr(first) + b"\xe0" + bytes([0xF5, SIG_SCRATCH])
141+
for addr in rest:
142+
code += mov_dptr(addr) + b"\xe0" # movx a,@dptr
143+
code += bytes([0x65, SIG_SCRATCH]) # xrl a,0x22
144+
code += bytes([0xF5, SIG_SCRATCH]) # mov 0x22,a
145+
for addr in SB_LANE:
146+
code += bytes([0x75, 0x93, SB_DPX]) # mov DPX,#1
147+
code += mov_dptr(addr) + b"\xe0" # movx a,@dptr (paged)
148+
code += bytes([0x75, 0x93, 0x00]) # mov DPX,#0
149+
code += bytes([0x65, SIG_SCRATCH]) # xrl a,0x22
150+
code += bytes([0xF5, SIG_SCRATCH]) # mov 0x22,a
151+
return bytes(code)
152+
153+
154+
def build_print():
155+
"""Emit the [P:...] line: PLAIN_PRE, SB lane bytes, PLAIN_POST."""
156+
code = bytearray()
157+
code += puts_code(PREFIX_ADDR)
158+
for addr in PLAIN_PRE:
159+
code += puthex_xdata(addr)
160+
for addr in SB_LANE:
161+
code += puthex_sb(addr)
162+
for addr in PLAIN_POST:
163+
code += puthex_xdata(addr)
164+
code += emit_char(']')
165+
code += emit_char('\r')
166+
code += emit_char('\n')
167+
return bytes(code)
168+
169+
170+
def build_hook():
171+
code = bytearray()
172+
for direct in PRESERVE:
173+
code += bytes([0xC0, direct]) # push direct
174+
175+
# DPX is preserved above; ensure DPX=0 for the plain reads.
176+
code += bytes([0x75, 0x93, 0x00]) # mov DPX,#0
177+
178+
code += build_signature() # -> sig in 0x22
179+
180+
# Compare sig (0x22) to stored 0x0BFE. If equal -> skip the print.
181+
code += mov_dptr(SIG_ADDR) + b"\xe0" # movx a,@dptr (old sig)
182+
code += bytes([0xB5, SIG_SCRATCH, 0x00]) # cjne a,0x22,neq ; rel filled below
183+
skip_branch = len(code) - 1 # location of the cjne rel byte
184+
185+
print_code = build_print()
186+
store_sig = mov_dptr(SIG_ADDR) + bytes([0xE5, SIG_SCRATCH]) + b"\xf0"
187+
188+
eq_sjmp = b"\x80\x00" # sjmp DONE (rel filled later)
189+
neq = bytearray()
190+
neq += store_sig
191+
neq += print_code
192+
193+
code[skip_branch] = len(eq_sjmp) & 0xFF # rel over the sjmp
194+
code += eq_sjmp
195+
eq_sjmp_pos = len(code) - 2
196+
code += neq
197+
198+
done = len(code)
199+
code[eq_sjmp_pos + 1] = (done - (eq_sjmp_pos + 2)) & 0xFF
200+
201+
for direct in reversed(PRESERVE):
202+
code += bytes([0xD0, direct]) # pop direct
203+
204+
code += HOOK_SITE_OLD # replay mov dptr,#0x0AE2
205+
code += b"\x22" # ret
206+
return bytes(code)
207+
208+
209+
def wrap_body(body):
210+
return (
211+
len(body).to_bytes(4, "little")
212+
+ body
213+
+ bytes([0xA5, sum(body) & 0xFF])
214+
+ zlib.crc32(body).to_bytes(4, "little")
215+
)
216+
217+
218+
def unwrap_image(data):
219+
if len(data) >= 10:
220+
body_len = int.from_bytes(data[:4], "little")
221+
footer = 4 + body_len
222+
if body_len + 10 == len(data) and data[footer] == 0xA5:
223+
body = data[4:footer]
224+
checksum = data[footer + 1]
225+
crc = int.from_bytes(data[footer + 2:footer + 6], "little")
226+
if checksum != (sum(body) & 0xFF):
227+
raise ValueError("wrapped firmware checksum mismatch")
228+
if crc != zlib.crc32(body):
229+
raise ValueError("wrapped firmware crc mismatch")
230+
return bytearray(body), True
231+
return bytearray(data), False
232+
233+
234+
def write_cave(body, addr, data, name):
235+
end = addr + len(data)
236+
if body[addr:end] != bytes(len(data)):
237+
raise ValueError(f"{name} cave at 0x{addr:04x} is not empty (len {len(data)})")
238+
body[addr:end] = data
239+
return end
240+
241+
242+
def apply_patch(body):
243+
found = bytes(body[HOOK_SITE_BODY:HOOK_SITE_BODY + len(HOOK_SITE_OLD)])
244+
if found != HOOK_SITE_OLD:
245+
raise ValueError(
246+
f"hook site mismatch at body 0x{HOOK_SITE_BODY:05x}: found {found.hex()}, "
247+
f"expected {HOOK_SITE_OLD.hex()}"
248+
)
249+
hook = build_hook()
250+
repl = bytearray(lcall(CAVE))
251+
while len(repl) < len(HOOK_SITE_OLD):
252+
repl += b"\x00"
253+
if len(repl) != len(HOOK_SITE_OLD):
254+
raise ValueError("replacement length mismatch")
255+
if CAVE + len(hook) > PREFIX_ADDR:
256+
raise ValueError(f"hook ({len(hook)} bytes) overruns PREFIX_ADDR 0x{PREFIX_ADDR:04x}")
257+
body[HOOK_SITE_BODY:HOOK_SITE_BODY + len(HOOK_SITE_OLD)] = repl
258+
write_cave(body, CAVE, hook, "phylock hook")
259+
write_cave(body, PREFIX_ADDR, PREFIX, "phylock prefix")
260+
return [("PHYLOCK", HOOK_SITE_BODY, CAVE, len(hook))]
261+
262+
263+
def main():
264+
ap = argparse.ArgumentParser()
265+
ap.add_argument("input", nargs="?", type=Path, default=DEFAULT_IN)
266+
ap.add_argument("output", nargs="?", type=Path, default=DEFAULT_OUT)
267+
args = ap.parse_args()
268+
269+
data = args.input.read_bytes()
270+
body, wrapped = unwrap_image(data)
271+
info = apply_patch(body)
272+
out = wrap_body(body) if wrapped else bytes(body)
273+
args.output.write_bytes(out)
274+
275+
print(f"input: {args.input} ({len(data)} bytes, wrapped={wrapped})")
276+
print(f"output: {args.output} ({len(out)} bytes)")
277+
for name, site, cave, hlen in info:
278+
print(f" {name}: site body 0x{site:05x} -> lcall 0x{cave:04x} "
279+
f"(hook {hlen} bytes -> 0x{cave + hlen:04x})")
280+
print(f" sig byte: XDATA 0x{SIG_ADDR:04x} (log-on-change)")
281+
print(f" line: [P:<C8FF><E302><E762><SBa0><SBa1><0779><077A><E764><C2D0><C350>]")
282+
283+
284+
if __name__ == "__main__":
285+
main()

handmade/src/usb4_lanebond.h

Lines changed: 33 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -671,26 +671,44 @@ static void u4lb_c593(void) {
671671
P1_WR(0x1285, (uint8_t)((P1_RD(0x1285) & 0x0F) | 0x30)); /* 0x1285 = (0x1285&0x0F)|0x30 */
672672
}
673673

674-
/* ---- b8db: CDR/PLL validate loop. CODE_BANK1::b8db (45152). Disasm-verified: a PROLOGUE picks an
675-
* early-return (stock does NOTHING -- no loop, no e9e7) in three cases, else a bounded retry loop that
676-
* polls the PLL-lock bit (c3a8 = bit6/0x40, NOT bit5) and fires e9e7 (RxPLL reset) on a miss. Caller
677-
* DISCARDS the return; the only hardware effect is the e9e7 set, so the early-returns matter (they stop
678-
* a spurious RxPLL reset right after the PHY trained). The per-iteration CDR-margin SUBB compare is
679-
* read-only and only gates loop-exit timing -- omitted here; we exit on PLL-lock. ---- */
674+
/* ---- b8db: CDR/PLL validate loop. CODE_BANK1::b8db (45152), full wrsweda2r disasm-verified body. A
675+
* PROLOGUE picks an early-return (stock does NOTHING) in three cases + sets the per-lane margin window
676+
* (lo/hi = CDR phase, lo52:lo54 / hi52:hi54 = a 16-bit eye margin), then a bounded (<=10) loop that on
677+
* EACH pass polls bit6 PLL-lock (c3a8) AND the full CDR-margin SUBB compare on C2D2/C2D9/C2DA/C352/C359/
678+
* C35A, and fires e9e7 (RxPLL reset) on ANY miss -- this is THE retry that drives C2D0/C350 from 0xE4 to
679+
* 0xF4 (bit4 = full CDR lock; stock-[P:]-confirmed). Exiting early on bit6 alone (the prior simplified
680+
* version) skipped those e9e7 retries -> the CDR never fully locked -> the lane never came up. Caller
681+
* discards the return (538d success print dropped, cosmetic). Locals -> static/__xdata (DSEG is full). ---- */
682+
static __xdata uint8_t b8db_lo, b8db_hi, b8db_lo52, b8db_hi52, b8db_lo54, b8db_hi54;
680683
static void u4lb_b8db(void) {
681-
uint8_t iter;
684+
__xdata uint8_t s50, s51, s52, s53, s54, s55;
685+
uint8_t cnt;
682686
if ((P1_RD(0x0000) & 0x02) == 0) { /* b8e4: plane-2 reg0.1 clear */
683687
if ((PR(0x92F8) & 0x0C) == 0) return; /* b953-b95f: nothing to validate */
688+
b8db_lo = 0x01; b8db_hi = 0x28; /* b962-b965 */
689+
b8db_lo52 = 0x01; b8db_hi52 = 0x3D; b8db_lo54 = 0x01; b8db_hi54 = 0x43; /* b968-b971 */
684690
} else if (PR(0x0750) == 1) { /* b8e7 */
685691
if (PR(0xC297) & 0x20) return; /* b8ef-b8fc: already locked */
692+
b8db_lo = 0x01; b8db_hi = 0x28; /* b8ff-b902 */
693+
if (PR(0x07BA) == 0) { b8db_lo52=0x01; b8db_hi52=0x47; b8db_lo54=0x01; b8db_hi54=0x4D; } /* b90d-b916 */
694+
else { b8db_lo52=0x01; b8db_hi52=0x3D; b8db_lo54=0x01; b8db_hi54=0x43; } /* b90b->b968 */
686695
} else {
687696
if (PR(0xC2A7) & 0x20) return; /* b91b-b928: already locked */
697+
b8db_lo = 0x01; b8db_hi = 0x20; /* b92b-b92e */
698+
if (PR(0x07BA) != 0) { b8db_lo52=0x01; b8db_hi52=0x3E; b8db_lo54=0x01; b8db_hi54=0x42; } /* b937-b940 */
699+
else { b8db_lo52=0x01; b8db_hi52=0x48; b8db_lo54=0x01; b8db_hi54=0x4C; } /* b945-b94e */
688700
}
689-
for (iter = 0; iter < 10; iter++) {
690-
(void)(PR(0xC2D2) & 0x3F); (void)PR(0xC2D9); (void)PR(0xC2DA); /* b977-b987: CDR shadow (read-only) */
691-
(void)(PR(0xC352) & 0x3F); (void)PR(0xC359); (void)PR(0xC35A); /* b989-b999 */
692-
if ((PR(0xC2D0) & 0x40) && (PR(0xC350) & 0x40)) break; /* c3a8: bit6 PLL lock */
693-
u4lb_e9e7(); /* b9f3: RxPLL reset on miss */
701+
for (cnt = 0; cnt < 10; cnt++) { /* b974 + ba01-ba08 bound */
702+
s50 = PR(0xC2D2) & 0x3F; s52 = PR(0xC2D9); s53 = PR(0xC2DA); /* b977-b987 */
703+
s51 = PR(0xC352) & 0x3F; s54 = PR(0xC359); s55 = PR(0xC35A); /* b989-b999 */
704+
if ((PR(0xC2D0) & 0x40) && (PR(0xC350) & 0x40) && /* c3a8 bit6 PLL lock, both lanes */
705+
s50 >= b8db_lo && s50 <= b8db_hi && s51 >= b8db_lo && s51 <= b8db_hi &&
706+
(uint8_t)(b8db_lo52 - (s53 < b8db_lo54 ? 1 : 0)) <= s52 &&
707+
s52 < (uint8_t)(b8db_hi52 - (s53 < (uint8_t)(b8db_hi54 + 1) ? 1 : 0)) &&
708+
(uint8_t)(b8db_lo52 - (s55 < b8db_lo54 ? 1 : 0)) <= s54 &&
709+
s54 < (uint8_t)(b8db_hi52 - (s55 < (uint8_t)(b8db_hi54 + 1) ? 1 : 0)))
710+
return; /* b9f8: full margin passed -> CDR locked */
711+
u4lb_e9e7(); /* b9f3: RxPLL reset on any poll/margin miss */
694712
}
695713
}
696714

@@ -960,7 +978,9 @@ static void u4lb_s5_diag(void) {
960978
uart_puts(" 775="); uart_puthex(h);
961979
uart_puts(" E764="); uart_puthex(REG_PHY_TIMER_CTRL_E764); uart_puts(" E762="); uart_puthex(PR(0xE762));
962980
uart_puts(" ED="); uart_puthex(PR(0x06ED));
963-
uart_puts(" snap="); uart_puthex(PR(0x0779)); uart_puthex(PR(0x077A)); uart_putc(']'); /* CL snap 0x0779/0x077A */
981+
uart_puts(" snap="); uart_puthex(PR(0x0779)); uart_puthex(PR(0x077A)); /* CL snap 0x0779/0x077A */
982+
uart_puts(" pll="); uart_puthex(PR(0xC2D0)); uart_puthex(PR(0xC350)); /* C2D0/C350 bit6 PLL-lock (stock=F4) */
983+
uart_putc(']');
964984
}
965985

966986
/* ---- 8501: e80a(R5R4=0x0065,R7=2) via trampoline 0x051b -- a banked SB-transport drain/poll. The FSM

0 commit comments

Comments
 (0)