
Some time ago I was looking at a hot section in our code and I saw this:
if (debug) {
log("...");
}
This got me thinking. This code is in a performance critical loop and it looks like a waste - we never run with the "debug" flag enabled[1]. Is it ok to have if clauses that will basically never be run? Surely, there must be some performance cost to that...
if statements?Back in the days the general rule was: a fully predictable branch has close to zero CPU cost.
To what extent is this true? If one branch is fine, then how about ten? A hundred? A thousand? When does adding one more if statement become a bad idea?
At some point the negligible cost of simple branch instructions surely adds up to a significant amount. As another example, a colleague of mine found this snippet in our production code:
const char *getCountry(int cc) {
if(cc == 1) return "A1";
if(cc == 2) return "A2";
if(cc == 3) return "O1";
if(cc == 4) return "AD";
if(cc == 5) return "AE";
if(cc == 6) return "AF";
if(cc == 7) return "AG";
if(cc == 1) return "AI";
...
if(cc == 252) return "YT";
if(cc == 253) return "ZA";
if(cc == 254) return "ZM";
if(cc == 255) return "ZW";
if(cc == 256) return "XK";
if(cc == 257) return "T1";
return "UNKNOWN";
}
Obviously, this code could be improved[2]. But when I thought about it more: should it be improved? Is there an actual performance hit of a code that consists of a series of simple branches?
We must start our journey with a bit of theory. We want to figure out if the CPU cost of a branch increases as we add more of them. As it turns out, assessing the cost of a branch is not trivial. On modern processors it takes between one and twenty CPU cycles. There are at least four categories of control flow instructions[3]: unconditional branch (jmp on x86), call/return, conditional branch (e.g. je on x86) taken and conditional branch not taken. The taken branches are especially problematic: without special care they are inherently costly - we'll explain this in the following section. To bring down the cost, modern CPU's try to predict the future and figure out the branch target before the branch is actually fully executed! This is done in a special part of the processor called the branch predictor unit (BPU).
The branch predictor attempts to figure out a destination of a branching instruction very early and with very little context. This magic happens before the "decoder" pipeline stage and the predictor has very limited data available. It only has some past history and the address of the current instruction. If you think about it - this is super powerful. Given only current instruction pointer it can assess, with very high confidence, where the target of the jump will be.

Source: https://en.wikipedia.org/wiki/Branch_predictor
The BPU maintains a couple of data structures, but today we'll focus on Branch Target Buffer (BTB). It's a place where the BPU remembers the target instruction pointer of previously taken branches. The whole mechanism is much more complex, take a look a the Vladimir Uzelac's Master thesis for details about branch prediction on CPU's from 2008 era:

For the scope of this article we'll simplify and focus on the BTB only. We'll try to show how large it is and how it behaves under different conditions.
But first, why is branch prediction used at all? In order to get the best performance, the CPU pipeline must feed a constant flow of instructions. Consider what happens to the multi-stage CPU pipeline on a branch instruction. To illustrate let's consider the following ARM program: