-
Notifications
You must be signed in to change notification settings - Fork 431
Dyno: cache function signature instantiations, reduce cache impact of some queries #27082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
DanilaFe
merged 8 commits into
chapel-lang:main
from
DanilaFe:further-further-perf-work
Apr 13, 2025
Merged
Dyno: cache function signature instantiations, reduce cache impact of some queries #27082
DanilaFe
merged 8 commits into
chapel-lang:main
from
DanilaFe:further-further-perf-work
Apr 13, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Danila Fedorin <[email protected]>
Signed-off-by: Danila Fedorin <[email protected]>
One always calls the other, which means we are allocating twice the entries and doing twice the lookups in the query cache otherwise. Signed-off-by: Danila Fedorin <[email protected]>
Signed-off-by: Danila Fedorin <[email protected]>
Signed-off-by: Danila Fedorin <[email protected]>
Signed-off-by: Danila Fedorin <[email protected]>
Signed-off-by: Danila Fedorin <[email protected]>
Signed-off-by: Danila Fedorin <[email protected]>
benharsh
approved these changes
Apr 9, 2025
Nice! |
DanilaFe
added a commit
that referenced
this pull request
Apr 15, 2025
…ns (#27105) Depends on #27082. This PR is another in a series of efforts to improve Dyno's performance. It has some minor changes, as well as a major one: enabling a rudimentary caching of function instantiations. Specifically, this PR enables caching and re-use of functions that did not use PoI (most of them, it turns out!). This had a huge impact on resolution performance: in my benchmarks, `chpl --dyno-resolve-only` is now __15-22% faster than production__ with `--stop-after-pass=resolve`. ## Memory performance In this PR, I spent a considerable amount of time looking at the memory allocations in Dyno, motivated by the fact that, when comparing performance, Dyno spent significantly more time in the "system", but a comparable amount of time in "user" execution. I assumed that File IO could not be the cause (presumably both Dyno and production are reading the same files), which made me investigate allocations. Indeed, Dyno makes a number of small allocations, and this is what I targeted first. 1. A global string, `<unknown library path>`, was being interned numerous times. This PR makes it a global, pre-allocated string, which greatly reduced the number of allocations while attempting to intern the string. 2. ~~The `CheckedScopes` set was updated numerous times as part of scope resolution, triggering small allocations. This PR switches to using C++'s [memory_resource](https://en.cppreference.com/w/cpp/header/memory_resource) framework, which enables stack-allocating some storage for the `CheckedScopes` set. This led to a modest performance improvement (around 3%)~~ * Apparently, this C++17 stdlib feature is not available prior to GCC 9.1, which is above our minimum of 7.5. Since it's not as crucial to the improvements in this PR, I've removed it to enable all builds to pass. ## Instantiation Caching I also tackled a direction suggested by @mppf: enabling re-use of function instantiations like production does (#16261). As a specific case, because functions that resolve without PoI are guaranteed to resolve in any PoI scope (because non-PoI candidates are always preferred), there's no need to re-resolve their bodies. This PR enables the specific case by building on the `PoiInfo` framework already in place through Michael's efforts. Although it seems to me that the `PoiInfo` was intended to fit into the `resolveFunctionByInfo` query (and enable other queries to use "similar", but not equal, calls to the query), I found that it's most efficient to introduce _another_ query (`resolveFunctionByResolvedInfo`). The current scheme is as follows: * `resolveFunctionByInfo` caches results of resolving a function in a particular PoI scope. This is for _any_ resolution, including resolutions that ended up using PoI functions. This appears useful when a function that uses PoI (and thus, is not covered by the optimization in this PR, which re-uses results of non-PoI instantiations), is called multiple times in the same scope. * `resolveFunctionByResolvedInfo` caches function instantiations along with their "PoI traces" (which track all PoI calls). This cache is queried based on `PoiInfo::canReuse`, which today simply checks if no POI function calls were made when resolving the body of the function. This covers a very large number of generic functions in a standard library. In the future, if `canReuse` is extended to check more cases, `resolveFunctionByResolvedInfo` will start caching those results too. Importantly, "PoI traces" used as keys for the `resolveFunctionByResolvedInfo` query can only be computed after a function is resolved. This necessitates having `resolveFunctionByInfo` set `resolveFunctionByResolvedInfo`. However, naively, setting queries does not fit into the query system. As a result, this PR adjusts the `QUERY_STORE_RESULT` to behave sanely, by invalidating results that were "stored" into queries externally under recomputation. ## Results In release mode, I've observed the following times: | Test | Time to resolve `issue-7139.chpl` | Cumulative improvement | |-----------------------|-----------------------------------|------------------------| | reference | 3.15s | - | | memory optimizations | 3.05s | 3% | | instantiation caching | 2.0s | 36% | Comparisons to production are encouraging as well. | | Time (Dyno #27082) | Time (Dyno, this PR) | Time (Production) | Relative Time (Production vs Dyno, this PR, __higher is better__) | |--------------------------|--------------------|----------------------|--------------------|-------------------------------------------------------------------| | Issue 7139 (Motivator) | 3.140 s ± 0.034 s | 1.968 s ± 0.052 s | 2.364 s ± 0.020 s | 1.22 ± 0.01 | | `parIters.chpl` primer | 2.372 s ± 0.029 s | 1.918 s ± 0.041 s | 2.181 s ± 0.027 s | 1.14 ± 0.03 | | `atomics.chpl` primer | 2.177 s ± 0.038 s | 1.888 s ± 0.040 s | 2.148 s ± 0.022 s | 1.14 ± 0.03 | | `forallLoops.chp` primer | 2.468 s ± 0.032 s | 1.936 s ± 0.034 s | 2.209 s ± 0.037 s | 1.14 ± 0.03 | **Above, we were 15-22% better than production.** > [!WARNING] > These performance results do not and will not represent Dyno's final performance. Dyno is not capable of resolving every primer, which indicates that there may be features missing (and thus, not "costing time"), and makes it so that I cannot perform an accurate performance analysis on a fully representative sample of programs. > > Moreover, Dyno's performance often comes down to caching. This might mean that for large programs (which consume more cache space), the time savings will be smaller. Reviewed by @dlongnecke-cray -- thanks! ## Testing - performance tests (see above) - dyno tests (4x faster now than the used to be before my work!) - paratest
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR aims to further improve Dyno's resolution performance.
I started by looking at the number of queries we call, in hopes that we can reduce the number of invocations to them. In the past,
idToAst
has been one of Dyno's hottest queries, and this remains to be the case. However, unlike previous times, I found no opportunities to elide calls to this function. Instead, I took the following steps to reduce the amount of work done by Dyno:scopeForId
and its recursive self-invocations make up the bulk of calls toidToAst
. I found no ways to reduce uses ofidToAst
inscopeForId
, but I did find a way to reduce the number ofscopeForId
invocations. It turns out that codeGatherMentionedModules
runsscopeForId
for every identifier, which invokesscopeForId
. Many identifiers share scopes (they don't create their own ones!), so this causes redundant cache entries. Instead, this PR adjustsGatherMentionedModules
to match other visitors (e.g.,Resolver
) and push scopes when a scope-creating AST node is entered. This actually increased the number of calls toscopeForId
, because not all scopes have identifiers. I thus further tweaked this to lazily invokescopeForId
when it's needed. This didn't have a noticeable runtime performance, but it did reduce the number of queries executed by a large number.returnType()
query always invokesreturnTypeWithoutIterable
(akayieldType()
). As a result, there is always an equal number of cache entries for these two. Moreover, several places in the resolver use both queries, which means double the lookups. I fused the two queries into a single, tuple-returning query. This halves the number of storage entries required for computing the return type, and, where applicable, also halves the number of query cache lookups. I didn't measure any performance impact in release mode, but it did reduce the number of queries executed.When I was debugging the above, I noticed issues with output as part of
--dyno-enable-tracing
caused by newlines inparam
strings. I adjusted the DETAIL logging of these strings to escape newlines so that tracing output is unaffected.After this, I turned back to profiler output. I noticed that
saveDependencyInParent
contributes a large amount of overhead. I also noticed some oddities in debug mode profiles: creating the start end end iterators for the recursion error set was taking a significant amount of time. There ought not be recursion errors at all! I guarded the recursion error insertion (which creates these iterators) behind a size check, and made other similar changes. This reduced the runtime overhead ofsaveDependencyInParent
by 0.5 seconds in the debug build, but the change is within noise in release mode.I also noticed that
CHPL_ASSERT
seems to execute its body in release mode. After checking with @arezaii, @dlongnecke-cray, and @mppf, it seems like there's no reason to do so in release. This PR removes that.Finally, following @benharsh's suggestion to investigate re-traversals, I discovered that the generated formals for the
_range
constructor were being re-traversed thousands of times. This was because calls toinstantiateSignature
were not cached, which meant that each invocation of a generic constructor triggered re-resolution. I turnedinstantiateSignature
into a query, winning roughly 10% in terms of performance on my benchmark (still the sample program from https://github.com/Cray/chapel-private/issues/7139): ~3.5 seconds -> ~3.15 seconds. This narrows the gap between Dyno and production on this benchmark to ~20%.Encouragingly, I'm seeing Dyno reach comparable performance while resolving other benchmarks. Comparing invocations of
--dyno-resolve-only
and--stop-after-pass=resolve
, I saw the following results:main
)parIters.chpl
primeratomics.chpl
primerforallLoops.chp
primerReviewed by @benharsh -- thanks!
Testing