Dyno: cache function signature instantiations, reduce cache impact of some queries #27082

DanilaFe · 2025-04-09T19:39:30Z

This PR aims to further improve Dyno's resolution performance.

I started by looking at the number of queries we call, in hopes that we can reduce the number of invocations to them. In the past, idToAst has been one of Dyno's hottest queries, and this remains to be the case. However, unlike previous times, I found no opportunities to elide calls to this function. Instead, I took the following steps to reduce the amount of work done by Dyno:

I determined that scopeForId and its recursive self-invocations make up the bulk of calls to idToAst. I found no ways to reduce uses of idToAst in scopeForId, but I did find a way to reduce the number of scopeForId invocations. It turns out that code GatherMentionedModules runs scopeForId for every identifier, which invokes scopeForId. Many identifiers share scopes (they don't create their own ones!), so this causes redundant cache entries. Instead, this PR adjusts GatherMentionedModules to match other visitors (e.g., Resolver) and push scopes when a scope-creating AST node is entered. This actually increased the number of calls to scopeForId, because not all scopes have identifiers. I thus further tweaked this to lazily invoke scopeForId when it's needed. This didn't have a noticeable runtime performance, but it did reduce the number of queries executed by a large number.
I observed that the returnType() query always invokes returnTypeWithoutIterable (aka yieldType()). As a result, there is always an equal number of cache entries for these two. Moreover, several places in the resolver use both queries, which means double the lookups. I fused the two queries into a single, tuple-returning query. This halves the number of storage entries required for computing the return type, and, where applicable, also halves the number of query cache lookups. I didn't measure any performance impact in release mode, but it did reduce the number of queries executed.

When I was debugging the above, I noticed issues with output as part of --dyno-enable-tracing caused by newlines in param strings. I adjusted the DETAIL logging of these strings to escape newlines so that tracing output is unaffected.

After this, I turned back to profiler output. I noticed that saveDependencyInParent contributes a large amount of overhead. I also noticed some oddities in debug mode profiles: creating the start end end iterators for the recursion error set was taking a significant amount of time. There ought not be recursion errors at all! I guarded the recursion error insertion (which creates these iterators) behind a size check, and made other similar changes. This reduced the runtime overhead of saveDependencyInParent by 0.5 seconds in the debug build, but the change is within noise in release mode.

I also noticed that CHPL_ASSERT seems to execute its body in release mode. After checking with @arezaii, @dlongnecke-cray, and @mppf, it seems like there's no reason to do so in release. This PR removes that.

Finally, following @benharsh's suggestion to investigate re-traversals, I discovered that the generated formals for the _range constructor were being re-traversed thousands of times. This was because calls to instantiateSignature were not cached, which meant that each invocation of a generic constructor triggered re-resolution. I turned instantiateSignature into a query, winning roughly 10% in terms of performance on my benchmark (still the sample program from https://github.com/Cray/chapel-private/issues/7139): ~3.5 seconds -> ~3.15 seconds. This narrows the gap between Dyno and production on this benchmark to ~20%.

Encouragingly, I'm seeing Dyno reach comparable performance while resolving other benchmarks. Comparing invocations of --dyno-resolve-only and --stop-after-pass=resolve, I saw the following results:

	Time (Dyno `main`)	Time (Dyno, this PR)	Time (Production)	Relative Time (Dyno, this PR vs Production, lower is better)
Issue 7139 (Motivator)	3.497 s ± 0.079 s	3.140 s ± 0.034 s	2.545 s ± 0.038 s	1.23 ± 0.02
`parIters.chpl` primer	2.483 s ± 0.072 s	2.372 s ± 0.029 s	2.314 s ± 0.023 s	1.02 ± 0.02
`atomics.chpl` primer	2.314 s ± 0.028 s	2.177 s ± 0.038 s	2.277 s ± 0.040 s	0.95 ± 0.03
`forallLoops.chp` primer	2.571 s ± 0.037 s	2.468 s ± 0.032 s	2.358 s ± 0.041 s	1.05 ± 0.02

Reviewed by @benharsh -- thanks!

Testing

dyno tests
paratest

Signed-off-by: Danila Fedorin <[email protected]>

One always calls the other, which means we are allocating twice the entries and doing twice the lookups in the query cache otherwise. Signed-off-by: Danila Fedorin <[email protected]>

Signed-off-by: Danila Fedorin <[email protected]>

mppf · 2025-04-10T14:24:34Z

Nice!

@mppf

…ns (#27105) Depends on #27082. This PR is another in a series of efforts to improve Dyno's performance. It has some minor changes, as well as a major one: enabling a rudimentary caching of function instantiations. Specifically, this PR enables caching and re-use of functions that did not use PoI (most of them, it turns out!). This had a huge impact on resolution performance: in my benchmarks, `chpl --dyno-resolve-only` is now __15-22% faster than production__ with `--stop-after-pass=resolve`. ## Memory performance In this PR, I spent a considerable amount of time looking at the memory allocations in Dyno, motivated by the fact that, when comparing performance, Dyno spent significantly more time in the "system", but a comparable amount of time in "user" execution. I assumed that File IO could not be the cause (presumably both Dyno and production are reading the same files), which made me investigate allocations. Indeed, Dyno makes a number of small allocations, and this is what I targeted first. 1. A global string, `<unknown library path>`, was being interned numerous times. This PR makes it a global, pre-allocated string, which greatly reduced the number of allocations while attempting to intern the string. 2. ~~The `CheckedScopes` set was updated numerous times as part of scope resolution, triggering small allocations. This PR switches to using C++'s [memory_resource](https://en.cppreference.com/w/cpp/header/memory_resource) framework, which enables stack-allocating some storage for the `CheckedScopes` set. This led to a modest performance improvement (around 3%)~~ * Apparently, this C++17 stdlib feature is not available prior to GCC 9.1, which is above our minimum of 7.5. Since it's not as crucial to the improvements in this PR, I've removed it to enable all builds to pass. ## Instantiation Caching I also tackled a direction suggested by @mppf: enabling re-use of function instantiations like production does (#16261). As a specific case, because functions that resolve without PoI are guaranteed to resolve in any PoI scope (because non-PoI candidates are always preferred), there's no need to re-resolve their bodies. This PR enables the specific case by building on the `PoiInfo` framework already in place through Michael's efforts. Although it seems to me that the `PoiInfo` was intended to fit into the `resolveFunctionByInfo` query (and enable other queries to use "similar", but not equal, calls to the query), I found that it's most efficient to introduce _another_ query (`resolveFunctionByResolvedInfo`). The current scheme is as follows: * `resolveFunctionByInfo` caches results of resolving a function in a particular PoI scope. This is for _any_ resolution, including resolutions that ended up using PoI functions. This appears useful when a function that uses PoI (and thus, is not covered by the optimization in this PR, which re-uses results of non-PoI instantiations), is called multiple times in the same scope. * `resolveFunctionByResolvedInfo` caches function instantiations along with their "PoI traces" (which track all PoI calls). This cache is queried based on `PoiInfo::canReuse`, which today simply checks if no POI function calls were made when resolving the body of the function. This covers a very large number of generic functions in a standard library. In the future, if `canReuse` is extended to check more cases, `resolveFunctionByResolvedInfo` will start caching those results too. Importantly, "PoI traces" used as keys for the `resolveFunctionByResolvedInfo` query can only be computed after a function is resolved. This necessitates having `resolveFunctionByInfo` set `resolveFunctionByResolvedInfo`. However, naively, setting queries does not fit into the query system. As a result, this PR adjusts the `QUERY_STORE_RESULT` to behave sanely, by invalidating results that were "stored" into queries externally under recomputation. ## Results In release mode, I've observed the following times: | Test | Time to resolve `issue-7139.chpl` | Cumulative improvement | |-----------------------|-----------------------------------|------------------------| | reference | 3.15s | - | | memory optimizations | 3.05s | 3% | | instantiation caching | 2.0s | 36% | Comparisons to production are encouraging as well. | | Time (Dyno #27082) | Time (Dyno, this PR) | Time (Production) | Relative Time (Production vs Dyno, this PR, __higher is better__) | |--------------------------|--------------------|----------------------|--------------------|-------------------------------------------------------------------| | Issue 7139 (Motivator) | 3.140 s ± 0.034 s | 1.968 s ± 0.052 s | 2.364 s ± 0.020 s | 1.22 ± 0.01 | | `parIters.chpl` primer | 2.372 s ± 0.029 s | 1.918 s ± 0.041 s | 2.181 s ± 0.027 s | 1.14 ± 0.03 | | `atomics.chpl` primer | 2.177 s ± 0.038 s | 1.888 s ± 0.040 s | 2.148 s ± 0.022 s | 1.14 ± 0.03 | | `forallLoops.chp` primer | 2.468 s ± 0.032 s | 1.936 s ± 0.034 s | 2.209 s ± 0.037 s | 1.14 ± 0.03 | **Above, we were 15-22% better than production.** > [!WARNING] > These performance results do not and will not represent Dyno's final performance. Dyno is not capable of resolving every primer, which indicates that there may be features missing (and thus, not "costing time"), and makes it so that I cannot perform an accurate performance analysis on a fully representative sample of programs. > > Moreover, Dyno's performance often comes down to caching. This might mean that for large programs (which consume more cache space), the time savings will be smaller. Reviewed by @dlongnecke-cray -- thanks! ## Testing - performance tests (see above) - dyno tests (4x faster now than the used to be before my work!) - paratest

DanilaFe added 8 commits April 7, 2025 14:37

Fix newlines in query tracing

6b28e69

Signed-off-by: Danila Fedorin <[email protected]>

Use a stack to avoid calling unnecessary scopeForId

d4f047f

Signed-off-by: Danila Fedorin <[email protected]>

Back returnType and yieldType by a single query

f5ba2df

One always calls the other, which means we are allocating twice the entries and doing twice the lookups in the query cache otherwise. Signed-off-by: Danila Fedorin <[email protected]>

Reduce overheads of saveDependencyInParent

8c81921

Signed-off-by: Danila Fedorin <[email protected]>

Disable code running in assertions in release mode

b681fe3

Signed-off-by: Danila Fedorin <[email protected]>

Cache 'instantiateSignature' as a query

87f4210

Signed-off-by: Danila Fedorin <[email protected]>

Avoid unused variable warnings when asserts are off

a0431dc

Signed-off-by: Danila Fedorin <[email protected]>

Don't use std::ignore

a8426c2

Signed-off-by: Danila Fedorin <[email protected]>

benharsh self-requested a review April 9, 2025 22:17

benharsh approved these changes Apr 9, 2025

View reviewed changes

DanilaFe mentioned this pull request Apr 13, 2025

Dyno: enable PoI instantiation cache and make other perf. optimizations #27105

Merged

DanilaFe merged commit 35e4f44 into chapel-lang:main Apr 13, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dyno: cache function signature instantiations, reduce cache impact of some queries #27082

Dyno: cache function signature instantiations, reduce cache impact of some queries #27082

Uh oh!

DanilaFe commented Apr 9, 2025 •

edited

Loading

Uh oh!

mppf commented Apr 10, 2025

Uh oh!

Uh oh!

Uh oh!

Dyno: cache function signature instantiations, reduce cache impact of some queries #27082

Dyno: cache function signature instantiations, reduce cache impact of some queries #27082

Uh oh!

Conversation

DanilaFe commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Uh oh!

mppf commented Apr 10, 2025

Uh oh!

Uh oh!

Uh oh!

DanilaFe commented Apr 9, 2025 •

edited

Loading