Exploring Clang/LLVM optimization on programming horror

Recently, I've come across a not so efficient implementation of a isEven function (from r/programminghorror).

bool isEven(int number)
{
    int numberCompare = 0;
    bool even = true;

    while (number != numberCompare)
    {
        even = !even;
        numberCompare++;
    }
    return even;
}

The code is in C++, but the essence of the algorithm is an iterative ascent to the input number from base case 0 (which is even), switching the boolean result at each iteration. It works, but you get a linear time O(n) isEven function compared to the obvious constant time O(1) modulo algorithm.

Surprisingly, Clang/LLVM is able to optimize the iterative algorithm down to the constant time algorithm (GCC fails on this one). In Clang 10 with full optimizations, this code compiles down to:

; Function Attrs: norecurse nounwind readnone ssp uwtable
define zeroext i1 @_Z6isEveni(i32 %0) local_unnamed_addr #0 {
  %2 = and i32 %0, 1
  %3 = icmp eq i32 %2, 0
  ret i1 %3
}

The first instruction is a boolean AND with 1 to keep only the least significant bit (a very fast way to compute a modulo %2), and the second instruction compares the result with 0.

I've decided to explore how Clang was able to do this. As you may know, an optimizer like LLVM (Clang backend) is designed with a bunch of specific passes with transform the code (in the form of LLVM IR) into another (preferably more optimized) code, keeping the same semantic. LLVM has a around 50 different optimization passes, each one targets a specific code pattern. When you give a specific optimization level to your compiler (such as -O2), there is a fixed list of optimization passes which will be executed on your code to optimize it (the list has already been defined for you by the compiler writers)1.

Original code

Clang compiles the C++ isEven function to straightforward assembly (in the form of LLVM IR) if you don't apply any optimization passes (-O0).

; Function Attrs: noinline nounwind ssp uwtable
define zeroext i1 @_Z6isEveni(i32 %0) #0 {
    %2 = alloca i32, align 4 ; number
    %3 = alloca i32, align 4 ; numberCompare
    %4 = alloca i8, align 1 ; even
    store i32 %0, i32* %2, align 4 ; store function argument in number
    store i32 0, i32* %3, align 4 ; store 0 in numberCompare
    store i8 1, i8* %4, align 1 ; store 1 (true) in even
    br label %5
5:                                                ; preds = %9, %1
    %6 = load i32, i32* %2, align 4
    %7 = load i32, i32* %3, align 4
    %8 = icmp ne i32 %6, %7
    br i1 %8, label %9, label %16
9:                                                ; preds = %5
    %10 = load i8, i8* %4, align 1
    %11 = trunc i8 %10 to i1
    %12 = xor i1 %11, true
    %13 = zext i1 %12 to i8
    store i8 %13, i8* %4, align 1
    %14 = load i32, i32* %3, align 4
    %15 = add nsw i32 %14, 1
    store i32 %15, i32* %3, align 4
    br label %5
16:                                               ; preds = %5
    %17 = load i8, i8* %4, align 1
    %18 = trunc i8 %17 to i1
    ret i1 %18
}

LLVM IR is a form of low-level (close to the machine) intermediate representation, with the particularity of being in Single Static Assignement form (SSA). Each instruction is always producing a new value (looking like %3), and not reassigning a previous one.

A graphical version of the same IR is called a Control Flow Graph (CFG), it is a graph between vertices called basic block (sequence of instructions which are always executed entirely from top to bottom) and edges which represent a possible flow for our program. For our code, the CFG looks like this:

The first block (numeroted %1) initializes the memory for the various variables inside the function (two 4-bytes integers for number and numberCompare, one byte for even).
The second block (numeroted %5) is the loop check to determine whether we should go inside the loop or exit it.
The third block (numeroted %9) is the loop body, it increments numberCompare (%15) and toggles the boolean even (%12).
The fourth block is the return of the function, it converts the result to a boolean value (%18) and returns it to the caller.

Optimizing the code

At this point, the code is simply an assembly SSA version of our original C++ code. Let's run some LLVM optimization passes on it to see how it evolves.