Skip to content

Rtti optimization#25707

Draft
Graveflo wants to merge 10 commits intonim-lang:develfrom
Graveflo:rtti-optimization
Draft

Rtti optimization#25707
Graveflo wants to merge 10 commits intonim-lang:develfrom
Graveflo:rtti-optimization

Conversation

@Graveflo
Copy link
Copy Markdown
Contributor

@Graveflo Graveflo commented Apr 6, 2026

Drafted. Might need a compile switch. Did not increase compiler boot times.

Summary

This PR changes tiny RTTI display encoding and adds C backend lowering for common object-dispatch chains.

goal: make common runtime subtype disambiguation patterns generate C that is closer to enum-style dispatch, especially for sibling and same-depth chains.

What changed

RTTI display

TNimTypeV2.display now uses compiler-assigned counters instead of sparse hash-derived values.

Each display entry packs two 16-bit discriminants:

  • low 16 bits: dense per-depth token
  • high 16 bits: dense sibling token under the immediate parent

This keeps the current runtime footprint while letting:

  • ordinary of checks use the per-depth lane
  • sibling-oriented dispatch use the sibling lane

C backend lowering

For same-selector object if/elif chains, the C backend now recognizes common dispatch shapes and emits a switch when possible.

Currently this is only enabled when tiny RTTI is enabled.

The useful fast paths are:

  • direct sibling chains
  • same-selector exact-type chains at one depth

Mixed-generation chains still fall back to ordinary condition chains.

Why

The old display values were sparse and not especially helpful to backend optimization.

  • the packed display scheme improves the discriminants used by runtime checks
  • the switch lowering is needed to achieve performance targets (particularly GCC)

Limits and tradeoffs

This encoding packs two 16-bit lanes into each display slot, so it introduces a hard limit of high(uint16) for each packed discriminator space.

That means compilation fails if either of these exceeds 65535:

  • the number of types assigned in a per-depth bucket
  • the number of sibling ordinals assigned under one parent

This is enforced with a compile-time error.

This PR also does not try to optimize every object-dispatch shape. Mixed-generation chains have unaffected performance characteristics.

Benchmark notes

Two temporary benchmarks are included for investigation and review:

  • tests/benchmarks/typedispatch.nim
  • tests/benchmarks/typedispatch_shapes.nim

These are not intended to be permanent and will be removed later.

The benchmark is only a directional signal. It is small, shape-sensitive, and backend-sensitive. Similar trends were seen with Clang; the numbers below
are from GCC on the C backend.

A few representative results from typedispatch_shapes.nim:

  • baseline:
    • family root of: 5.139 ns/op
    • exact sibling of: 5.160 ns/op
    • exact root of: 6.661 ns/op
  • new, packed display + trivial switch lowering:
    • family root of: 0.913 ns/op
    • exact sibling of: 0.906 ns/op
    • exact root of: 7.113 ns/op
  • new, packed display + extended switch lowering:
    • family root of: 1.228 ns/op
    • exact sibling of: 1.161 ns/op
    • exact root of: 1.232 ns/op

For reference, the corresponding kind-based baselines in the same benchmark were around 0.87-1.01 ns/op.

The main observation is:

  • sibling and family dispatch become much closer to kind-based dispatch
  • same-depth exact-root chains also improve once lowered to switch
  • mixed-generation chains remain the slow path

@Graveflo Graveflo mentioned this pull request Apr 6, 2026
@Graveflo
Copy link
Copy Markdown
Contributor Author

Graveflo commented Apr 12, 2026

similar performance benefits observed using C++ back-end

edit: also PR still needs to be cleaned up

@Araq
Copy link
Copy Markdown
Member

Araq commented Apr 12, 2026

I worry how this interacts with the IC mechanism, you basically optimize these via fullprogram knowledge -- something which bites IC.

@Graveflo
Copy link
Copy Markdown
Contributor Author

I worry how this interacts with the IC mechanism, you basically optimize these via fullprogram knowledge -- something which bites IC.

Yea I was thinking about this today. In the current state of the PR, type ordinals are stable when they are observed in the same order through sem. As I understand it, NIF should have all the information needed to build the tables. Right now I think this depends heavily on the backend artifacts being checked since the tables themselves are not persisted. Further, there might be sensitivity problems like changing the import order, and one change to an object hierarchy triggering every module the type tree touches to be recompiled. I haven't tested it though, so I don't know how bad it would be.

This is mostly an additive change, so maybe a compiler switch for a tradeoff?

@Araq
Copy link
Copy Markdown
Member

Araq commented Apr 12, 2026

Well, yes, in principle we have the full program as the set of NIF files, but we must ensure it keeps working and isn't too messy. Hard to test with the hardly working IC impl, I know...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants