Skip to content

Memory Disambiguation on Skylake

travisdowns edited this page Feb 28, 2018 · 30 revisions

Here's a description of observed memory disambiguation behavior on Skylake. Although this has "Skylake" in the title, it is likely the same or similar behavior is shared among nearby-in-time Intel micro-architectures, but I haven't tested it. You'll probably want to know a bit of x86 assembly to get full value, but the prose should still be understandable even if the assembly isn't.

The Basics

I won't describe memory disambiguation in great detail, but here's a brief background on the topic.

First, the naming: this general topic may also be known as store forward prediction, or memory aliasing prediction or memory dependence speculation, or various other terms. The basic idea is that in an out-of-order architecture, a load that follows the earlier store to the same (or overlapping) address needs to be satisfied (perhaps partially) from the oldest such store, and not from stale data from the L1 cache or some store before that. Such loads are said to alias the earlier store, and in high-performance implementations the store is typically forwarded to the load directly from the store buffer.

In a simple implementation with a store buffer that olds loads before they are committed to the L1 cache, this means that loads cannot execute before all prior store addresses are known, since it isn't possible to determine which, if any, prior in-progress store aliases the load. In high performance implementations, this is a significant limitation for some code since hoisting loads above address-unknown stores may provide a large speedup. So it is common for implementations to have machinery to speculate that some load doesn't alias any in-progress store and to hoist it above earlier address-unknown stores. The speculation is checked before the load retires and if it turns out wrong, execution is typically rolled back to load and replayed from that point, at a cost somewhat similar to other types of speculation failure such branch mis-predictions.

I highly recommend this article by Henry Wong for a deeper look at this specific topic and measurements on various architectures of steady state behavior for non-aliasing, partially-aliasing and fully-aliasing cases, as well as "fast data" and "fast address" cases. Here's a brief description of store buffers.

Prediction

Background, Patents

The basic idea of prediction is to identify loads that with high likelihood don't alias an earlier address-unknown store, so that they can hoisted. This prediction should probably be conservative (erring on the side of not hoisting loads), since even the occasional mis-prediction is very costly (typically 10-20 cycles on modern Intel). The basic idea is pretty simple and follows the same pattern as branch prediction: track the past behavior of loads based on IP and only hoist loads that have a pattern of not aliasing. The details are important though, to the point that WARF has variously sued and licensed their '752 patent on this topic for probably somewhere around 1e9 dollars (my own rough estimate). That patent, sometimes called the Moshovos patent also makes good reading on basic predictor designs.

Skylake

Let's try to figure out how Skylake actually works. In fact, I'll jump straight to the conclusion:

Skylake appears to use a hashed per-PC predictor without exact-PC confirmation, in combination with a global watchdog predictor that can overrides the per-PC predictor when it is "active".

Now we can work backwards to explain piece-by-piece what that abomination of a sentence means, piece by piece. In my own investigation, I very partially reversed engineered the behavior, which allowed me to Google with better terms at which point I came across the '263 patent from Intel, which accelerated the process since I now was just checking if the Skylake implementation lined up with the patent (hint: it does, mostly), and validating various parameters. So you can probably learn almost as much by reading the patent, but frankly that's not fun and the below is at least backed up by real-world tests.

Attempt 1

Let's try to write some code that will trigger store-forwarding and hence possibly trigger a store-forwarding mis-speculation. That's easy: just read from a location that was recently written to:

Clone this wiki locally