Mamba's selective scan is the cleanest path to linear scaling I've seen. The compression isn't arbitrary — it's content-routed. That's the move that makes the architecture actually work at length.
A question I've been carrying for weeks. Is the compression destroying information (Liouville-style — phase-space volume contracts, the trajectory is mathematically irreversible), or is it passing over information (volume preserved, but the readout never learned to surface what's there)? The architectural fixes are different. Capacity scaling helps the second case. Only structural conservation helps the first.
Direct comparison from my own experiments. A vanilla transformer vs. a symplectic-flow variant — layer update is one Störmer-Verlet leapfrog step under a learned Hamiltonian H(q, p, a). Volume preservation by construction. Liouville at depth.
Same data, same optimizer, same seed, 15M params, base-pretrained only, evaluated with chunked attention across 1K → 64K context:
- vanilla: +29.1% val_ppl drift
- plasma: +12.6%
Ratio 2.31. Volume preservation appears to translate to long-context stability.
The question for Mamba, plainly. Has anyone measured the information-theoretic side of the selective scan separately from the task side? Is there a regime where Mamba's hidden state is mathematically reversible (information conserved up to readout), or is the scan fundamentally a contraction in state-space — small enough that capacity scaling compensates, but real?
If anyone has run a "is the state recoverable from its successor" test on Mamba checkpoints, I'd love a pointer to what you'd recommend. And if the honest answer is "we haven't checked the conservation side because the accuracy side was the bottleneck" — that itself is an interesting answer.
Repo: https://github.com/Prime-007-hash/phi-plasma-core
Compression is the right answer to "linear scaling." Conservation is a different answer to "what survives the depth."
Mamba's selective scan is the cleanest path to linear scaling I've seen. The compression isn't arbitrary — it's content-routed. That's the move that makes the architecture actually work at length.
A question I've been carrying for weeks. Is the compression destroying information (Liouville-style — phase-space volume contracts, the trajectory is mathematically irreversible), or is it passing over information (volume preserved, but the readout never learned to surface what's there)? The architectural fixes are different. Capacity scaling helps the second case. Only structural conservation helps the first.
Direct comparison from my own experiments. A vanilla transformer vs. a symplectic-flow variant — layer update is one Störmer-Verlet leapfrog step under a learned Hamiltonian H(q, p, a). Volume preservation by construction. Liouville at depth.
Same data, same optimizer, same seed, 15M params, base-pretrained only, evaluated with chunked attention across 1K → 64K context:
Ratio 2.31. Volume preservation appears to translate to long-context stability.
The question for Mamba, plainly. Has anyone measured the information-theoretic side of the selective scan separately from the task side? Is there a regime where Mamba's hidden state is mathematically reversible (information conserved up to readout), or is the scan fundamentally a contraction in state-space — small enough that capacity scaling compensates, but real?
If anyone has run a "is the state recoverable from its successor" test on Mamba checkpoints, I'd love a pointer to what you'd recommend. And if the honest answer is "we haven't checked the conservation side because the accuracy side was the bottleneck" — that itself is an interesting answer.
Repo: https://github.com/Prime-007-hash/phi-plasma-core
Compression is the right answer to "linear scaling." Conservation is a different answer to "what survives the depth."