Performance Speed Limits | Performance Matters

With that background out of the way, let’s look at the various OoO limits next. Most of these limits have the same effect which is to limit the available out-of-order window, stalling issue until a resource becomes available. They differ mostly in what they count, and how many of that thing can be buffered.

First, here’s a big table of all the resource sizes31 we’ll talk about the following sections.

Reorder Buffer Size

The ROB is the largest and most general out of order buffer: all uops, even those that don’t execute such as nop or zeroing idioms, take a slot in the ROB44. This structure holds instructions from the point at which they are allocated (issued, in Intel speak) until they retire. It puts a hard upper limit on the OoO window as measured from the oldest un-retired instruction to the youngest instruction that can be issued. On Intel, the ROB holds micro-fused ops, so the size is measured in the fused-domain.

As an example, a load instruction takes a cache miss which means it cannot retire until the miss is complete. Let’s say the load takes 300 cycles to finish, which is a typical latency. Then, on an Haswell machine with a ROB size of 192, at most 191 additional instructions can execute while waiting for the load: at that point the ROB window is exhausted and the core stalls. This puts an upper bound on the maximum IPC of the region of 192 / 300 = 0.64. It also puts a bound on the maximum MLP achievable, since only loads that appear in the next 191 instructions can (potentially) execute in parallel with the original miss. In fact, this behavior is used by Henry Wong’s robsize tool to measure the ROB size and other OoO buffer sizes, using a missed load followed by a series of filler instructions and finally another load miss. By varying the number of filler instructions and checking whether the loads executed in parallel or serially, the ROB size can be determined experimentally.

Remedies

If you are hitting the ROB size limit, you should switch from optimizing the code for the usual metrics and instead try to reduce the number of uops. For example, a slower (longer latency, less throughput) instruction can be used to replace two instructions which would otherwise be faster. Similarly, micro-fusion helps because the ROB limit counts in the fused domain.

Reorganizing the instruction stream can help too: if you hit the ROB limit after a specific long-latency instruction (usually a load miss) you may want to move expensive instructions into the shadow of that instruction so they can execute while the long latency instruction executes. In this way, there will be less work to do when the instruction completes. Similarly, you may want to “jam” loads that miss together: rather than spreading them out where they would naturally occur, putting them close together allows more of them to fit in the ROB window.

In the specific case of load misses, software prefetching can help a lot: it enables you to start a load early, but prefetches can retire before the load completes, so there is no stalling. For example, if you issue the prefetch 200 instructions before the demand load instruction, you have essentially broadened the ROB by 200 instructions as it applies to that load.

Load Buffer

Every load operation, needs a load buffer entry. This means the total OoO window is limited by the number loads appearing in the window. Typical load buffer sizes (72 on SKL) seem to be about one third of the ROB size, so if more than about one out of three operations is a load, you are more likely to be limited by the load buffer than the ROB.

Gathers need as many entries as there are loaded elements to load in the gather. Sometimes loads are hidden - remember that things like pop involve a load: in general anything that executes an op on p2 or p3 which is not a store (i.e., does not execute anything on p4) needs an entry in the load buffer.

Remedies

First, you should evaluate whether getting under this limit will be helpful: it may be that you will almost immediately hit another OoO limit, and it also may be that increasing the OoO window isn’t that useful if the extra included instructions can’t execute or aren’t a bottleneck.

In any case, the remedy is to use fewer loads, or in some cases to reorganize loads relative to other instructions so that the window implied by the full load buffer contains the most useful instructions (in particular, contains long latency instructions like load misses). You can try to combine narrower loads into wider ones. You can ensure you keep values in registers as much as possible, and inline functions that would otherwise pass arguments through memory (e.g., certain structures) to avoid pointless loads. If you need to spill some registers, consider spilling registers to xmm or ymm vector registers rather than the stack.

Store Buffer

Similarly to the load buffer, the store buffer is required for every operation that involves a store. In fact, filling up the store buffer is pretty much the only way stores can bottleneck performance. Unlike loads, nobody is waiting for a store to complete, except in the case of store-to-load forwarding - but there, by definition, the value is sitting inside the store queue ready to use, so there is no equivalent of the long load miss which blocks dependent operations. You can have long store misses, but they happen after the store has already retired and is sitting in the store buffer (or write-combining buffer). So stores primarily cause a problem if there are enough of them such that the store buffer fill up.