WIP

filipnavara · filipnavara · commit 4e571e81d000 · 2024-06-16T19:58:54.000+02:00
diff --git a/README.md b/README.md
@@ -93,15 +93,20 @@ The build process for `net8.0-macos` runs the ILLink first and then pipes the ou
 
 ### Windows
 
-TODO: .NET 9 Preview 5
+32-bit support for ILC targets was only added in .NET 9 Preview 5, as previously stated. We produce the x86 binaries for variety of reason but the biggest one being the burden of distributing the Chromium Embedded Framework which we use for displaying HTML content. Other apps may have similar requirements to produce 32-bit binaries to consume external libraries, or to build components that run inside 32-bit contexts.
 
-### ARM processors
+Unlike CoreCLR (as of .NET 9 Preview 5), the NativeAOT port for win-x86 uses funclet based exception handling instead of the legacy SEH code. The funclet exception handling was easier to implement in the NativeAOT runtime because it shares code with most of the other targets. 
+As exception handling is largely rewritten in CoreCLR for .NET 9 to bring it closer to NativeAOT and to improve performance, it's likely that in the future this topic will be revisited and the approaches unified. CoreCLR historically offers interoperabilty with SEH exceptions thrown in interop scenarios while NativeAOT never supported this. This may prove to be a challenge in any future unification endeavours.
 
-TODO: ARM[64], branch limits, linker
+### Undocumented limits
 
-### X86 oddities
+One thing of note is that non-X86/X64 platforms often come with some limits imposed by the instruction set. While in most cases this is handled transparently to the developer there are cases where one needs to be aware of those limits. Due to tooling bugs these limits may result in user-visible errors, degraded performance or larger executable file size.
 
-TODO: X86 is odd, exception handling
+A good example of such limit are the relative branches in the generated code. Notably, ARM64 has +/- 128 MiB limit for those branches and ARM32 has +- 16 MiB limit in the Thumb2 instruction encoding. Any longer branch in the code needs to be encoded with different set of instructions, or handled through a thunk. Thunk, in this case, is a piece of code that's within the relative code location limit and that contains the longer jump sequence. These thunks are typically produced by the platform linker but in some cases the compiler can generate them too. ILC, the NativeAOT compiler, currently doesn't generate them at the expense of generating more pessimistic code.
+
+Another set of limits is imposed by the object file format - eg. section sizes or file offset represented using 32-bit data types - limiting the output executable size to roughly 2 GiB. These limits differ by platform and the 2 GiB figure is not perfectly accurate but it's a good ballpark figure.
+
+Lastly, one undocumented limit is the size of the unwinding section on Apple platforms (macOS/iOS/tvOS). The unwinding section is used for exception handling, producing stack traces, and for garbage collection. We will talk about this particular limit later in the section dedicated to [Object Writer](ilc-resource-usage.md#object-file-emitting).
 
 ## Main project
 
diff --git a/ilc-resource-usage.md b/ilc-resource-usage.md
@@ -59,7 +59,32 @@ The object writer supports three general object file formats - COFF (Windows), M
 and ELF (Linux, FreeBSD). The general structure of the object file format is pretty similar 
 across the formats even though the on-disk structure is different.
 
-TODO: Unwind tables, compact unwind tables, 24M limitation, lack of compact unwinding on
-macOS and WinARM64
+## Unwind tables
 
-TODO: Experiment with scaling by parallelization of GetData
+One particular part of the object file writer emits the unwind tables. The unwind tables are used for walking the call stack during exception handling, garbage collection, and other operations. The format of these tables depends on the platform.
+
+Windows have their own table based format that's described in the [PE file format specification](https://learn.microsoft.com/en-us/windows/win32/debug/pe-format). There's a PDATA table that maps code address to unwind data. The actual unwind data are in the XDATA table. Format of these tables differs slightly for each platform. Notably, ARM64 has a way to represent common function prolog/epilog using compact codes. This is currently not used by ILC and it may be worth exploring to get some size benefits.
+
+macOS and Linux both historically use the same standardized form - DWARF. The format is a bit verbose but also incredibly powerful. In fact, it's so powerful that [academic paper proved that it's Turning complete](https://static.usenix.org/event/woot11/tech/final_files/Oakley.pdf).
+
+Because DWARF itself is very generic and verbose, Apple has come up with a proprietary solution that augments it - compact unwinding tables. The general concept is the same as in Windows - represent the common prologs/epilogs with a simple 32-bit code, and fallback everything more complext to the DWARF format. Just like Windows the actual code format differs per platform. The compact unwinding tables serve a secondary purpose too. DWARF itself doesn't have an index for quickly locating an unwind information given a code address. The solution that Apple devised is to use the compact unwinding tables as the index and reserve a single code with 24-bit offset as a way to map a code address into link inside the DWARF table. The downside of this approach is that the size of the DWARF section is limited to around 16 MiB.
+
+.NET ARM64 JIT currently generate prologs are incompatible with the Apple Compact Unwinding encoding. There's an [issue filed](https://github.com/dotnet/runtime/issues/76371) to investigate how can we improve that in the future. Notably, while evaluating the benefits of compact unwinding we found that typical programs contain roughly between 20% and 30% leaf methods (ie. methods that don't call any other method) with a prolog that doesn't save any registers on the stack except for LR (Link Register) and FP (Frame Pointer). Turns out that this specific sequence can be represented by compact unwinding, so ILC detects it and emits the compact unwinding code there. The large proliferation of these methods [begs the question whether we can get rid of the prolog/epilog](https://github.com/dotnet/runtime/issues/88823) and get code size improvements in addition to the size optimization of the unwind tables.
+
+The 16 MiB size limit for DWARF section on Apple platforms is generally problematic. It places an additional limit on the executable code size that is difficult to express and document. Some versions of Apple linker are known to silently produce corrupted output when this limit is hit which is notoriously difficult to diagnose. We also cannot reliably detect it in the ILC compiler since the limit is imposed on the final executable, not on the object files. An error can be produced when the DWARF section size hits the limit in the generated object file but it doesn't guaranteed that the final executable is not broken. 
+
+# Experiments
+
+When we embarked on the journey of profiling the ILC compiler, the expectation was that each of the above phases will be clearly visible in the profile. That we will see the memory usage increase steadily during the mark and compile phase and then suddenly quickly go up for a short period while the object file was emitted.
+
+Turns out, it's never so simple. WinForms in particular have rather deep object hierarchies which hit [unexpected bottlenecks in the marking phase](https://github.com/dotnet/runtime/issues/103034). There were few other similar issues that were hit along the way:
+- [Hash codes that produced guaranteed collisions](https://github.com/dotnet/runtime/issues/103070)
+- [Type checks that unexpectedly dominate processing time](https://github.com/dotnet/runtime/pull/103199)
+- [Caches that perform worse than if the cache was not present](https://github.com/dotnet/runtime/issues/103285)
+- [Repeated dictionary lookups that can be optimized](https://github.com/dotnet/runtime/pull/103202)
+
+Most of these issues are clearly visible in the profiler when compiling project of this size. They are often easy to fix with just few lines of code. Ever wondered when your knowledge of algorithms and O-notation comes useful in the real world? Then this is precisely the area. Simple changes may result in huge speed ups. Fixing the above problems saves more than 3 minutes, or 30%, on the compilation times.
+
+Other notable insight is that DATAS reduced the peak memory usage from ~20 GiB to ~15 GiB (private working set). 
+
+Lastly, there are some expensive `ObjectNode.GetData` calls that are run in single-threaded fashion in the Object File Emitting phase. This is an area for potential improvement in compile times. We implemented a [limited parallelization of the `GetData` calls](https://github.com/filipnavara/runtime/commit/44cc7e653300cce8db142c285ab0651c1635b06b). This resulted in a speed up of 20+ seconds. Unfortunately, the prototype doesn't guarantee full determinsic output as is. A more limited approach focusing on precalculating just the expensive data (eg. virtual method slots in `EETypeNode`) may be a viable alternative to explore that should yield at least 60% of the time savings produced with the simple prototype.
diff --git a/third-party-libraries.md b/third-party-libraries.md
@@ -75,7 +75,7 @@ extension methods also come with overloads that include the `JsonTypeInfo<T>` pa
 ### XML serialization and deserialization
 
 There's currently no source generator for XML serializers like there's one for JSON. The 
-options to handle the sitation are limited. We employed two different approaches to handle 
+options to handle the sitation are limited. We employed <strike>two</strike>three different approaches to handle 
 the problem. In some cases the easiest way is to hand write the serialization using the 
 `XmlReader`/`XmlWriter` classes. In other cases we crafted a way to use the SGen tools to 
 generate the (de)serialization code automatically with minimal refactoring.
@@ -86,6 +86,7 @@ and `XmlSchema GetSchema()`. For our purposes the last method is irrelevant and
 return `null`. The other two methods then implement strongly typed (de)serialization that 
 can be called instead of `XmlSerializer.[De]Serialize`.
 
+<strike>
 In cases where the amount of XML was too big for the hand-written approach we devised a 
 clever trick to use SGen to generate the code and embed it. For those of you not familiar 
 with SGen, it is a tool that was available since .NET Framework to pregenerate serialization 
@@ -114,6 +115,12 @@ then it gets included into a compile item for the main project. For example, if
 `MyDataClass` class then you will get a `MyDataClassSerializer` generated class. The new 
 `MyDataClassSerializer` class can now be used in place of `XmlSerializer`. This will 
 still generate code that produces AOT warnings but those can be ignored with a local suppression.
+</strike>
+
+It turns out that the `XmlSerializer` still depends on reflection even for the source generated 
+mapping due to how it internally handles `Mode`. ILLink annotations may be the only way to 
+workaround it for now, and at that point there's very little benefit to using the pregenerated 
+serializers.
 
 ## Alternative AOT-safe libraries