Rtti optimization#25707
Conversation
|
similar performance benefits observed using C++ back-end edit: also PR still needs to be cleaned up |
|
I worry how this interacts with the IC mechanism, you basically optimize these via fullprogram knowledge -- something which bites IC. |
Yea I was thinking about this today. In the current state of the PR, type ordinals are stable when they are observed in the same order through sem. As I understand it, NIF should have all the information needed to build the tables. Right now I think this depends heavily on the backend artifacts being checked since the tables themselves are not persisted. Further, there might be sensitivity problems like changing the import order, and one change to an object hierarchy triggering every module the type tree touches to be recompiled. I haven't tested it though, so I don't know how bad it would be. This is mostly an additive change, so maybe a compiler switch for a tradeoff? |
|
Well, yes, in principle we have the full program as the set of NIF files, but we must ensure it keeps working and isn't too messy. Hard to test with the hardly working IC impl, I know... |
Drafted. Might need a compile switch. Did not increase compiler boot times.
Summary
This PR changes tiny RTTI display encoding and adds C backend lowering for common object-dispatch chains.
goal: make common runtime subtype disambiguation patterns generate C that is closer to enum-style dispatch, especially for sibling and same-depth chains.
What changed
RTTI display
TNimTypeV2.display now uses compiler-assigned counters instead of sparse hash-derived values.
Each display entry packs two 16-bit discriminants:
This keeps the current runtime footprint while letting:
C backend lowering
For same-selector object if/elif chains, the C backend now recognizes common dispatch shapes and emits a switch when possible.
Currently this is only enabled when tiny RTTI is enabled.
The useful fast paths are:
Mixed-generation chains still fall back to ordinary condition chains.
Why
The old display values were sparse and not especially helpful to backend optimization.
Limits and tradeoffs
This encoding packs two 16-bit lanes into each display slot, so it introduces a hard limit of high(uint16) for each packed discriminator space.
That means compilation fails if either of these exceeds 65535:
This is enforced with a compile-time error.
This PR also does not try to optimize every object-dispatch shape. Mixed-generation chains have unaffected performance characteristics.
Benchmark notes
Two temporary benchmarks are included for investigation and review:
These are not intended to be permanent and will be removed later.
The benchmark is only a directional signal. It is small, shape-sensitive, and backend-sensitive. Similar trends were seen with Clang; the numbers below
are from GCC on the C backend.
A few representative results from typedispatch_shapes.nim:
For reference, the corresponding kind-based baselines in the same benchmark were around 0.87-1.01 ns/op.
The main observation is: