Skip to content

Dyno: cache function signature instantiations, reduce cache impact of some queries #27082

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 13, 2025

Conversation

DanilaFe
Copy link
Contributor

@DanilaFe DanilaFe commented Apr 9, 2025

This PR aims to further improve Dyno's resolution performance.

I started by looking at the number of queries we call, in hopes that we can reduce the number of invocations to them. In the past, idToAst has been one of Dyno's hottest queries, and this remains to be the case. However, unlike previous times, I found no opportunities to elide calls to this function. Instead, I took the following steps to reduce the amount of work done by Dyno:

  • I determined that scopeForId and its recursive self-invocations make up the bulk of calls to idToAst. I found no ways to reduce uses of idToAst in scopeForId, but I did find a way to reduce the number of scopeForId invocations. It turns out that code GatherMentionedModules runs scopeForId for every identifier, which invokes scopeForId. Many identifiers share scopes (they don't create their own ones!), so this causes redundant cache entries. Instead, this PR adjusts GatherMentionedModules to match other visitors (e.g., Resolver) and push scopes when a scope-creating AST node is entered. This actually increased the number of calls to scopeForId, because not all scopes have identifiers. I thus further tweaked this to lazily invoke scopeForId when it's needed. This didn't have a noticeable runtime performance, but it did reduce the number of queries executed by a large number.
  • I observed that the returnType() query always invokes returnTypeWithoutIterable (aka yieldType()). As a result, there is always an equal number of cache entries for these two. Moreover, several places in the resolver use both queries, which means double the lookups. I fused the two queries into a single, tuple-returning query. This halves the number of storage entries required for computing the return type, and, where applicable, also halves the number of query cache lookups. I didn't measure any performance impact in release mode, but it did reduce the number of queries executed.

When I was debugging the above, I noticed issues with output as part of --dyno-enable-tracing caused by newlines in param strings. I adjusted the DETAIL logging of these strings to escape newlines so that tracing output is unaffected.

After this, I turned back to profiler output. I noticed that saveDependencyInParent contributes a large amount of overhead. I also noticed some oddities in debug mode profiles: creating the start end end iterators for the recursion error set was taking a significant amount of time. There ought not be recursion errors at all! I guarded the recursion error insertion (which creates these iterators) behind a size check, and made other similar changes. This reduced the runtime overhead of saveDependencyInParent by 0.5 seconds in the debug build, but the change is within noise in release mode.

I also noticed that CHPL_ASSERT seems to execute its body in release mode. After checking with @arezaii, @dlongnecke-cray, and @mppf, it seems like there's no reason to do so in release. This PR removes that.

Finally, following @benharsh's suggestion to investigate re-traversals, I discovered that the generated formals for the _range constructor were being re-traversed thousands of times. This was because calls to instantiateSignature were not cached, which meant that each invocation of a generic constructor triggered re-resolution. I turned instantiateSignature into a query, winning roughly 10% in terms of performance on my benchmark (still the sample program from https://github.com/Cray/chapel-private/issues/7139): ~3.5 seconds -> ~3.15 seconds. This narrows the gap between Dyno and production on this benchmark to ~20%.

Encouragingly, I'm seeing Dyno reach comparable performance while resolving other benchmarks. Comparing invocations of --dyno-resolve-only and --stop-after-pass=resolve, I saw the following results:

Time (Dyno main) Time (Dyno, this PR) Time (Production) Relative Time (Dyno, this PR vs Production, lower is better)
Issue 7139 (Motivator) 3.497 s ± 0.079 s 3.140 s ± 0.034 s 2.545 s ± 0.038 s 1.23 ± 0.02
parIters.chpl primer 2.483 s ± 0.072 s 2.372 s ± 0.029 s 2.314 s ± 0.023 s 1.02 ± 0.02
atomics.chpl primer 2.314 s ± 0.028 s 2.177 s ± 0.038 s 2.277 s ± 0.040 s 0.95 ± 0.03
forallLoops.chp primer 2.571 s ± 0.037 s 2.468 s ± 0.032 s 2.358 s ± 0.041 s 1.05 ± 0.02

Reviewed by @benharsh -- thanks!

Testing

  • dyno tests
  • paratest

DanilaFe added 8 commits April 7, 2025 14:37
Signed-off-by: Danila Fedorin <[email protected]>
One always calls the other, which means we are allocating
twice the entries and doing twice the lookups in the query
cache otherwise.

Signed-off-by: Danila Fedorin <[email protected]>
Signed-off-by: Danila Fedorin <[email protected]>
@benharsh benharsh self-requested a review April 9, 2025 22:17
@mppf
Copy link
Member

mppf commented Apr 10, 2025

Nice!

@DanilaFe DanilaFe merged commit 35e4f44 into chapel-lang:main Apr 13, 2025
10 checks passed
DanilaFe added a commit that referenced this pull request Apr 15, 2025
…ns (#27105)

Depends on #27082.

This PR is another in a series of efforts to improve Dyno's performance.
It has some minor changes, as well as a major one: enabling a
rudimentary caching of function instantiations. Specifically, this PR
enables caching and re-use of functions that did not use PoI (most of
them, it turns out!). This had a huge impact on resolution performance:
in my benchmarks, `chpl --dyno-resolve-only` is now __15-22% faster than
production__ with `--stop-after-pass=resolve`.

## Memory performance
In this PR, I spent a considerable amount of time looking at the memory
allocations in Dyno, motivated by the fact that, when comparing
performance, Dyno spent significantly more time in the "system", but a
comparable amount of time in "user" execution. I assumed that File IO
could not be the cause (presumably both Dyno and production are reading
the same files), which made me investigate allocations. Indeed, Dyno
makes a number of small allocations, and this is what I targeted first.

1. A global string, `<unknown library path>`, was being interned
numerous times. This PR makes it a global, pre-allocated string, which
greatly reduced the number of allocations while attempting to intern the
string.
2. ~~The `CheckedScopes` set was updated numerous times as part of scope
resolution, triggering small allocations. This PR switches to using
C++'s
[memory_resource](https://en.cppreference.com/w/cpp/header/memory_resource)
framework, which enables stack-allocating some storage for the
`CheckedScopes` set. This led to a modest performance improvement
(around 3%)~~
* Apparently, this C++17 stdlib feature is not available prior to GCC
9.1, which is above our minimum of 7.5. Since it's not as crucial to the
improvements in this PR, I've removed it to enable all builds to pass.

## Instantiation Caching
I also tackled a direction suggested by @mppf: enabling re-use of
function instantiations like production does (#16261). As a specific
case, because functions that resolve without PoI are guaranteed to
resolve in any PoI scope (because non-PoI candidates are always
preferred), there's no need to re-resolve their bodies. This PR enables
the specific case by building on the `PoiInfo` framework already in
place through Michael's efforts.

Although it seems to me that the `PoiInfo` was intended to fit into the
`resolveFunctionByInfo` query (and enable other queries to use
"similar", but not equal, calls to the query), I found that it's most
efficient to introduce _another_ query
(`resolveFunctionByResolvedInfo`). The current scheme is as follows:

* `resolveFunctionByInfo` caches results of resolving a function in a
particular PoI scope. This is for _any_ resolution, including
resolutions that ended up using PoI functions. This appears useful when
a function that uses PoI (and thus, is not covered by the optimization
in this PR, which re-uses results of non-PoI instantiations), is called
multiple times in the same scope.
* `resolveFunctionByResolvedInfo` caches function instantiations along
with their "PoI traces" (which track all PoI calls). This cache is
queried based on `PoiInfo::canReuse`, which today simply checks if no
POI function calls were made when resolving the body of the function.
This covers a very large number of generic functions in a standard
library. In the future, if `canReuse` is extended to check more cases,
`resolveFunctionByResolvedInfo` will start caching those results too.

Importantly, "PoI traces" used as keys for the
`resolveFunctionByResolvedInfo` query can only be computed after a
function is resolved. This necessitates having `resolveFunctionByInfo`
set `resolveFunctionByResolvedInfo`. However, naively, setting queries
does not fit into the query system. As a result, this PR adjusts the
`QUERY_STORE_RESULT` to behave sanely, by invalidating results that were
"stored" into queries externally under recomputation.

## Results

In release mode, I've observed the following times:

| Test | Time to resolve `issue-7139.chpl` | Cumulative improvement |

|-----------------------|-----------------------------------|------------------------|
| reference | 3.15s | - |
| memory optimizations | 3.05s | 3% |
| instantiation caching | 2.0s | 36% |

Comparisons to production are encouraging as well.

| | Time (Dyno #27082) | Time (Dyno, this PR) | Time (Production) |
Relative Time (Production vs Dyno, this PR, __higher is better__) |

|--------------------------|--------------------|----------------------|--------------------|-------------------------------------------------------------------|
| Issue 7139 (Motivator) | 3.140 s ± 0.034 s | 1.968 s ± 0.052 s | 2.364
s ± 0.020 s | 1.22 ± 0.01 |
| `parIters.chpl` primer | 2.372 s ± 0.029 s | 1.918 s ± 0.041 s | 2.181
s ± 0.027 s | 1.14 ± 0.03 |
| `atomics.chpl` primer | 2.177 s ± 0.038 s | 1.888 s ± 0.040 s | 2.148
s ± 0.022 s | 1.14 ± 0.03 |
| `forallLoops.chp` primer | 2.468 s ± 0.032 s | 1.936 s ± 0.034 s |
2.209 s ± 0.037 s | 1.14 ± 0.03 |

**Above, we were 15-22% better than production.** 

> [!WARNING]  
> These performance results do not and will not represent Dyno's final
performance. Dyno is not capable of resolving every primer, which
indicates that there may be features missing (and thus, not "costing
time"), and makes it so that I cannot perform an accurate performance
analysis on a fully representative sample of programs.
>
> Moreover, Dyno's performance often comes down to caching. This might
mean that for large programs (which consume more cache space), the time
savings will be smaller.

Reviewed by @dlongnecke-cray -- thanks!

## Testing
- performance tests (see above)
- dyno tests (4x faster now than the used to be before my work!)
- paratest
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants