Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dyno: cache function signature instantiations, reduce cache impact of some queries #27082

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

DanilaFe
Copy link
Contributor

@DanilaFe DanilaFe commented Apr 9, 2025

This PR aims to further improve Dyno's resolution performance.

I started by looking at the number of queries we call, in hopes that we can reduce the number of invocations to them. In the past, idToAst has been one of Dyno's hottest queries, and this remains to be the case. However, unlike previous times, I found no opportunities to elide calls to this function. Instead, I took the following steps to reduce the amount of work done by Dyno:

  • I determined that scopeForId and its recursive self-invocations make up the bulk of calls to idToAst. I found no ways to reduce uses of idToAst in scopeForId, but I did find a way to reduce the number of scopeForId invocations. It turns out that code GatherMentionedModules runs scopeForId for every identifier, which invokes scopeForId. Many identifiers share scopes (they don't create their own ones!), so this causes redundant cache entries. Instead, this PR adjusts GatherMentionedModules to match other visitors (e.g., Resolver) and push scopes when a scope-creating AST node is entered. This actually increased the number of calls to scopeForId, because not all scopes have identifiers. I thus further tweaked this to lazily invoke scopeForId when it's needed. This didn't have a noticeable runtime performance, but it did reduce the number of queries executed by a large number.
  • I observed that the returnType() query always invokes returnTypeWithoutIterable (aka yieldType()). As a result, there is always an equal number of cache entries for these two. Moreover, several places in the resolver use both queries, which means double the lookups. I fused the two queries into a single, tuple-returning query. This halves the number of storage entries required for computing the return type, and, where applicable, also halves the number of query cache lookups. I didn't measure any performance impact in release mode, but it did reduce the number of queries executed.

When I was debugging the above, I noticed issues with output as part of --dyno-enable-tracing caused by newlines in param strings. I adjusted the DETAIL logging of these strings to escape newlines so that tracing output is unaffected.

After this, I turned back to profiler output. I noticed that saveDependencyInParent contributes a large amount of overhead. I also noticed some oddities in debug mode profiles: creating the start end end iterators for the recursion error set was taking a significant amount of time. There ought not be recursion errors at all! I guarded the recursion error insertion (which creates these iterators) behind a size check, and made other similar changes. This reduced the runtime overhead of saveDependencyInParent by 0.5 seconds in the debug build, but the change is within noise in release mode.

I also noticed that CHPL_ASSERT seems to execute its body in release mode. After checking with @arezaii, @dlongnecke-cray, and @mppf, it seems like there's no reason to do so in release. This PR removes that.

Finally, following @benharsh's suggestion to investigate re-traversals, I discovered that the generated formals for the _range constructor were being re-traversed thousands of times. This was because calls to instantiateSignature were not cached, which meant that each invocation of a generic constructor triggered re-resolution. I turned instantiateSignature into a query, winning roughly 10% in terms of performance on my benchmark (still the sample program from https://github.com/Cray/chapel-private/issues/7139): ~3.5 seconds -> ~3.15 seconds. This narrows the gap between Dyno and production on this benchmark to ~20%.

Encouragingly, I'm seeing Dyno reach comparable performance while resolving other benchmarks. Comparing invocations of --dyno-resolve-only and --stop-after-pass=resolve, I saw the following results:

Time (Dyno main) Time (Dyno, this PR) Time (Production) Relative Time (Dyno, this PR vs Production, lower is better)
Issue 7139 (Motivator) 3.497 s ± 0.079 s 3.140 s ± 0.034 s 2.545 s ± 0.038 s 1.23 ± 0.02
parIters.chpl primer 2.483 s ± 0.072 s 2.372 s ± 0.029 s 2.314 s ± 0.023 s 1.02 ± 0.02
atomics.chpl primer 2.314 s ± 0.028 s 2.177 s ± 0.038 s 2.277 s ± 0.040 s 0.95 ± 0.03
forallLoops.chp primer 2.571 s ± 0.037 s 2.468 s ± 0.032 s 2.358 s ± 0.041 s 1.05 ± 0.02

DanilaFe added 8 commits April 7, 2025 14:37
Signed-off-by: Danila Fedorin <[email protected]>
One always calls the other, which means we are allocating
twice the entries and doing twice the lookups in the query
cache otherwise.

Signed-off-by: Danila Fedorin <[email protected]>
Signed-off-by: Danila Fedorin <[email protected]>
@benharsh benharsh self-requested a review April 9, 2025 22:17
@mppf
Copy link
Member

mppf commented Apr 10, 2025

Nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants