Skip to content

Memory Disambiguation on Skylake

travisdowns edited this page Mar 4, 2018 · 30 revisions

Here's a description of observed memory disambiguation behavior on Skylake. Although this has "Skylake" in the title, it is likely the same or similar behavior is shared among nearby-in-time Intel micro-architectures, but I haven't tested it. You'll probably want to know a bit of x86 assembly to get full value, but the prose should still be understandable even if the assembly isn't.

The Basics

I won't describe memory disambiguation in great detail, but here's a brief background on the topic.

First, the naming: this general topic may also be known as store forward prediction, or memory aliasing prediction or memory dependence speculation, or various other terms. The basic idea is that in an out-of-order architecture, a load that follows the earlier store to the same (or overlapping) address needs to be satisfied (perhaps partially) from the oldest such store, and not from stale data from the L1 cache or some store before that. Such loads are said to alias the earlier store, and in high-performance implementations the store is typically forwarded to the load directly from the store buffer.

In a simple implementation with a store buffer that olds loads before they are committed to the L1 cache, this means that loads cannot execute before all prior store addresses are known, since it isn't possible to determine which, if any, prior in-progress store aliases the load. In high performance implementations, this is a significant limitation for some code since hoisting loads above address-unknown stores may provide a large speedup. So it is common for implementations to have machinery to speculate that some load doesn't alias any in-progress store and to hoist it above earlier address-unknown stores. The speculation is checked before the load retires and if it turns out wrong, execution is typically rolled back to load and replayed from that point, at a cost somewhat similar to other types of speculation failure such branch mis-predictions.

I highly recommend this article by Henry Wong for a deeper look at this specific topic and measurements on various architectures of steady state behavior for non-aliasing, partially-aliasing and fully-aliasing cases, as well as "fast data" and "fast address" cases. Here's a brief description of store buffers.

Prediction

Background, Patents

The basic idea of prediction is to identify loads that with high likelihood don't alias an earlier address-unknown store, so that they can hoisted. This prediction should probably be conservative (erring on the side of not hoisting loads), since even the occasional mis-prediction is very costly (typically 10-20 cycles on modern Intel). The basic idea is pretty simple and follows the same pattern as branch prediction: track the past behavior of loads based on IP and only hoist loads that have a pattern of not aliasing. The details are important though, to the point that WARF has variously sued and licensed their '752 patent on this topic for probably somewhere around 1e9 dollars (my own rough estimate). That patent, sometimes called the Moshovos patent also makes good reading on basic predictor designs.

Skylake

Let's try to figure out how Skylake actually works. In fact, I'll jump straight to the conclusion:

Skylake appears to use a hashed per-PC predictor without exact-PC confirmation, in combination with a global watchdog predictor that can overrides the per-PC predictor when it is "active".

Now we can work backwards to explain piece-by-piece what that abomination of a sentence means, piece by piece. In my own investigation, I very partially reversed engineered the behavior, which allowed me to Google with better terms at which point I came across the '263 patent from Intel, which accelerated the process since I now was just checking if the Skylake implementation lined up with the patent (hint: it does, mostly), and validating various parameters. So you can probably learn almost as much by reading the patent, but frankly that's not fun and the below is at least backed up by real-world tests.

Attempt 1

Let's try to write some code that will trigger store-forwarding and hence possibly trigger a store-forwarding mis-speculation. That's easy: just read from a location that was recently written to:

mov     DWORD [rsi], 0
mov     eax, DWORD [rsi]
mov     DWORD [rsi], 0
mov     eax, DWORD [rsi]
...

Here rsi points to a page-aligned heap allocated buffer (we only ever touch the first DWORD of this buffer).

With this pair of instructions repeated 100 times and run with oneshot mode in uarch-bench using:

./uarch-bench.sh --timer=libpfc --test-name=memory/store-fwd-try/* --extra-events=MACHINE_CLEARS.MEMORY_ORDERING

... we get the following results1:

stfwd-try1 @ 0x0x485000
                     Benchmark   Sample   Cycles   MACHIN
                    stfwd-try1        1   267.00     0.00
                    stfwd-try1        2    98.00     0.00
                    stfwd-try1        3    98.00     0.00
                    stfwd-try1        4    98.00     0.00
                    stfwd-try1        5    98.00     0.00
                    stfwd-try1        6    98.00     0.00
                    stfwd-try1        7    98.00     0.00
                    stfwd-try1        8    98.00     0.00
                    stfwd-try1        9    98.00     0.00
                    stfwd-try1       10    98.00     0.00
                    stfwd-try1       11    98.00     0.00
                    stfwd-try1       12    98.00     0.00
                    stfwd-try1       13    98.00     0.00
                    stfwd-try1       14    98.00     0.00
                    stfwd-try1       15    98.00     0.00
                    stfwd-try1       16    98.00     0.00
                    stfwd-try1       17    98.00     0.00
                    stfwd-try1       18    98.00     0.00
                    stfwd-try1       19    98.00     0.00
                    stfwd-try1       20    98.00     0.00

The first execution of the code shows a large number of cycles, probably due to missing the L1I cache and other effects of executing cold code. The remaining 19 iterations all execute in exactly 98 iterations, reflecting the fact that stores can occur at one per cycle2, and that store-forwarding can happen in parallel with a throughput of at least 1 per cycle. More importantly, we see zero memory ordering clears caused by mis-predicted store forwarding, even on the first run, where the predictors are necessarily cold for this code.

Now perhaps this isn't entirely surprising. The store address in the code above will generally always be available at the moment the store is executed, so we can expect the store data to be available pretty much right away to following load. In order words, it isn't likely the CPU is even going to hoist the load above the store before its address is know, since the address is always known immediately - there is no need to speculate at all here.

So let's slow down the store address, but not the load address so that we might actually see loads being hoisted. We'll do that like this:

; setup: copy rdx into rsi so we can have independent-but-equal store and load addresses
mov     rdx, rsi 

imul    rdx, 1
mov     DWORD [rdx], 0
mov     eax, DWORD [rsi]
imul    rdx, 1 
mov     DWORD [rdx], 0
mov     eax, DWORD [rsi]
... ; repeat the load/store pair 98 more times

It's similar to the first test, but we use rdx as the store address, and in between each store we multiply rdx by 1: this is a logical no-op, but it adds 3 cycles of latency to the calculation of the address for the next store, so the store addresses readiness will always be lagging the load address readiness which is identical but uses rsi which isn't part of the long imul dependency chain.

Here's a typical result:

stfwd-try2 @ 0x0x485400
                     Benchmark   Sample   Cycles   MACHIN
                    stfwd-try2        1   447.00     4.00
                    stfwd-try2        2   304.00     0.00
                    stfwd-try2        3   304.00     0.00
                    stfwd-try2        4   304.00     0.00
                    stfwd-try2        5   304.00     0.00
                    stfwd-try2        6   304.00     0.00
                    stfwd-try2        7   304.00     0.00
                    stfwd-try2        8   303.00     0.00
                    stfwd-try2        9   304.00     0.00
                    stfwd-try2       10   304.00     0.00
                    stfwd-try2       11   304.00     0.00
                    stfwd-try2       12   304.00     0.00
                    stfwd-try2       13   304.00     0.00
                    stfwd-try2       14   303.00     0.00
                    stfwd-try2       15   304.00     0.00
                    stfwd-try2       16   304.00     0.00
                    stfwd-try2       17   304.00     0.00
                    stfwd-try2       18   304.00     0.00
                    stfwd-try2       19   304.00     0.00
                    stfwd-try2       20   303.00     0.00

Finally we are getting some memory speculation failures, even if it's only 4. Although it's the most common, that result isn't the only one though, on repeated runs this result was also common:

stfwd-try2 @ 0x0x485400
                     Benchmark   Sample   Cycles   MACHIN
                    stfwd-try2        1   395.00     2.00
                    stfwd-try2        2   404.00     3.00
                    stfwd-try2        3   304.00     0.00
                    stfwd-try2        4   304.00     0.00
                    stfwd-try2        5   303.00     0.00
                    stfwd-try2        6   304.00     0.00
                    stfwd-try2        7   304.00     0.00
... snip

Similarly, runs with 3 and then 1 mis-predictions in the first and second samples (followed as usual by all zeros) were also common.

The overall behavior was fairly consistent though: you'd get a few mis-speculations limited usually to the first or second pass though the code, and then zero. In particular, you never see more than 4 mis-speculations in one iteration. The fact we get zero mis-speculations on most subsequent iterations makes sense: at this point the CPU has seen the loads multiple times, and knows they do alias earlier stores3 and so the loads won't be hoisted.

The big question is: Why only 4 (or sometimes 2 or 3) mis-speculations on the first iteration? Our trick of delaying the store address to force mis-speculations has pretty clearly worked (virtually every run shows mis-speculations), but I would expect every load to mis-speculate here, since we are delaying every store. We are training the predictors to predict "does alias" for future samples, here all our 100 loads live at separate PCs and should have separate predictors4.

Perhaps every mis-prediction causes a small window where no more mis-predictions happen, so we only get a maximum of 1 mis-prediction every 25 loads or so? Let's test it by varying the number of load store pairs.

So I tried this with 20 load store pairs instead of 100. The results were very similar to the prior case: many runs with 4 mis-predictions followed by all-zeros, some with the 3, 2, 0, 0, ... pattern. Some start out with zero for for the first sample, but have 3 or 4 mis-predictions in a later sample. In general though we see about the same number of mispredictions as the case with 100 loads. Only when we drop the number of store-load pairs down very low, e.g., to 4 loads, do we see a reduction in mis-predictions (typically 2). On the other handing, greatly increasing the number of store-load pairs to 1,000 shows again the same 4, 0, 0, ... pattern (although more consistently: there are fewer 2, 3, ... and 3, 1, ... patterns in this case). You can try these variants yourself with --test-name */stfwd-try2-* on the command line.

One could reasonably conclude that the mis-predictions are thus not spread out among all the store-load pairs, but largely occur at the start, in the first few loads, and then stop occurring at all. It seems there is some global mechanism were mis-speculations earlier in the instruction stream can affect the behavior of later different loads that have never been executed. We also find that 4 continues to be a magic number: we never see samples with more than 4 mis-speculations.

In fact, Intel describes exactly such a global mechanism in the '263 patent, which they call a watchdog. From the first claim:

first logic to predict a load operation will not conflict with older pending store operations if a saturation counter circuit corresponding to the load operation is at least at a threshold value and a maximum rate of mispredictions has not occurred, wherein the saturation counter circuit is to be incremented if the load operation does not conflict with the older pending store operations;

a watchdog flush counter circuit to be decremented if a prediction by the first logic is not correct; and

a watchdog disambiguation counter circuit to be incremented if the prediction by the first logic is correct, wherein the first logic is to be disabled if the ratio of the disambiguation counter circuit value and the flush counter circuit value reaches a minimum value.

If you read the rest of the patent, it becomes clear that the first logic they describe above is the regular PC-based predictor tied to a specific load (let's call this the fine-grained predictor from now on). Then the watchdog described in the last paragraph is the global mechanism we are seeing: if the watchdog is active, loads will be predicted alias (i.e., will not be hoisted), regardless of whatever prediction is/would be made by the fine-grained predictor.

In fact, we find the likely origin of the magic number 4 in the section describing the detailed operation of the watchdog counter:

Similarly, the prediction mechanism may be disabled after a desired ratio of successful to unsuccessful predictions is met. For example, in one embodiment the prediction mechanism is disabled if 4 (corresponding to a 2 bit flush counter, for example) or more mispredictions occurs for every 1024 (corresponding to a 16 bit disambiguation counter, for example) successful predictions. In other embodiments, other techniques may be used to track the success rate of predictions, such as time-dependent counters, in order to determine when to enable or disable the prediction mechanism.

In one embodiment they use a mechanism where 4 mis-predictions in a row5 will enable the watchdog, which basically means the CPU is running in a conservative mode where loads are never hoisted. That would exactly explain the behavior above: where we never see more than 4 mispredictions. It also explains the various behaviors where we don't see 4 mis-predictions in the first run: the counters described start in some semi-random state: if they are already partway to the threshold it may take less than 4 mis-predictions to enable the watchdog. The other states like 3, 1, 0, 0, ... or 2, 3, 0, 0, ...


Footnotes

1 These results are fairly representative of repeated runs of this benchmark, although you can see the occasional outlier perhaps reflecting an interrupt or other external event. Some runs have a mix of 98 cycles and 99 cycles, rather than all 98 cycles - there are a lot of micro-architectural details that can make a 1-cycle difference!

2 This reason this can be less than 100, despite stores executing at a maximum rate of 1 per cycle is itself somewhat interesting: the results are calculating by subtracting measurement overhead based on running the identical measurement code against an empty function. A function call has a minimum latency of about 4 cycles (i.e., back-to-back function calls take at least 4 cycles), but other work can happen in parallel with that cost. The upshot is that an empty function may take the same time as a function with 1-4 cycles of "real work", since the real work cost is hidden by the minimum function call latency. A function that does say 5 cycles of real work will show up as as taking 5 - 4 = 1 cycle of total work, since most of the cost is hidden by the function call latency. There are a couple solutions to this problem, but I'll leave that to another day.

3 I'm speaking a bit loosely here when I say the predictor knows they alias: I should really say if the predictor had earlier thought they didn't alias, it would have corrected itself by now. In this case, however, the vast majority (96 out of 100, usually) the predictor never allowed the load to be hoisted in the first place, and we aren't sure if it's even tracking the aliasing state for loads that it didn't hoist (although we'll investigate it).

4 Granted, the number of predictors (or more precisely the size of the predictor state) isn't unlimited, so we might expect distinct loads to alias in the predictor tables, but in general these tables are much larger than 4.

5 Of course, the mispredictions don't need to be in the row: the basic idea is that if the mispredict:predict ratio is worse than 4:1024 the watchdog will eventually become enabled - but 4 in a row should always be enough to trigger the watchdog from any starting state.

Clone this wiki locally