The Alder Lake anomaly, explained

Part 1 Tavian Barnes Reddit

A couple days ago, I posted about an anomaly in the latency of the SHLX instruction on Alder Lake performance cores. Pete Cawley and others on Twitter and Hacker News pointed me in the right direction and I now have a better understanding of when the anomaly happens and what causes it. Not a complete understanding of why it exists in the first place, but until I get an electron microscope this might be the most I can hope for :)

The anomaly is related to register renaming, an optimization performed by CPUs that maps a limited set of architectural registers (rax, rbx, ...) to a larger set of physical registers. The CPU uses something like Tomasulo's algorithm to convert machine code like this:

        mov  rax, rdi ; rax = rdi
        add  rax, 1   ; rax = rax + 1
        imul rax, rdi ; rax = rax * rdi
        shr  rax      ; rax = rax >> 1

into micro-operations that look more like this:

                           ; (initially)      (rdi -> pr1)
                           ; (rename)         (rax -> pr1)
        add  pr2, pr1, 1   ; pr2 = pr1 + 1    (rax -> pr2)
        imul pr3, pr2, pr1 ; pr3 = pr2 * pr1  (rax -> pr3)
        shr  pr4, pr3      ; pr4 = pr3 >> 1   (rax -> pr4)

Renaming has eliminated the mov rax, rdi instruction by pointing both rax and rdi to the same physical register, pr1. In a sense, mov is "free": it has zero latency since no µop is even issued for it.

In 2021, Andreas Abel (creator of uops.info) noticed that the Alder Lake renamer is even more powerful than this. It can also eliminate the add rax, 1 instruction! You can observe this by measuring the throughput of repeated adds:

        add rax, 1
        add rax, 1
        add rax, 1
        ...

Normally this would execute at 1 instruction per cycle, since each add needs to wait 1 cycle for the previous add to complete. But on Alder Lake P-cores, this runs at 5 instructions per cycle! It appears that the renamer doesn't just support simple mappings like rax -> pr1, but also more complicated ones like rax -> (pr1 + 1). The instruction stream gets translated to something like

        ; initially (rax -> pr1)
        ; rename    (rax -> pr1 + 1)
        ; rename    (rax -> pr1 + 2)
        ; rename    (rax -> pr1 + 3)
        ...

Almost no µops are actually being issued, so this basically runs as fast as the frontend can decode and rename instructions. The offsets seem to be limited to 11 bits, restricting them to the range [-1024, 1023]. That means that every once in a while, the offset overflows and the renamer has to give up and actually issue an add µop. If you make the immediate larger, this will happen more often, and you can observe reduced throughput. For example, repeated add rax, 256 is handled like this:

        ; initially (rax -> pr1)
        ; rename    (rax -> pr1 + 256)
        ; rename    (rax -> pr1 + 512)
        ; rename    (rax -> pr1 + 768)
        add pr2, (pr1 + 768), 256      ; renaming would overflow
        ; rename    (rax -> pr2 + 256)
        ; rename    (rax -> pr2 + 512)
        ; rename    (rax -> pr2 + 768)
        add pr3, (pr2 + 768), 256
        ...

This runs at only 2.5 instructions per cycle, because not all the adds are eliminated. It's not just add that gets optimized like this; all of these are supported:

        mov rax, imm11         ; rax -> (pr0* +  imm11)  (*the zero register)
        add rax, imm11         ; rax -> (prax +  imm11)
        sub rax, imm11         ; rax -> (prax + -imm11)
        inc rax                ; rax -> (prax +  1)
        dec rax                ; rax -> (prax + -1)
        lea rax, [rbx + imm11] ; rax -> (prbx +  imm11)

Only 64-bit operand sizes are supported, so mov eax, 1 doesn't trigger it despite having the same semantics as mov rax, 1.

Something remarkable about this optimization is that every µop needs to be able to handle operands like (pr2 + 768) rather than just registers like pr2. Almost all of them seem to do this with no performance penalty. (I suspect the value of operands like (pr2 + 768) is computed somewhere between reading the physical register file and writing to the execution port.) That is, every µop except shifts.

For some reason, if the shift count register for any dynamic shift is renamed to one of these special (prX + Y) operands, the µop will take 3 cycles instead of 1. This includes all the normal shift instructions (shl/shr/sar) and rotate instructions (rol/ror/rcl/rcr), as well as the BMI2 ones (shlx/shrx/sarx). We just noticed the BMI2 ones first because uops.info happened to trigger this anomaly when measuring them.

All input operands of every shift instruction are affected by this anomaly. (An earlier version of this post claimed otherwise, but that was a mistake in my benchmarking code.) This is true for both dynamic and static shift counts:

        mov  eax, 1
        shl  rax    ; 1 cycle
        mov  rax, 1
        shl  rax    ; 3 cycles

        mov  eax, 1
        shl  rax, 2 ; 1 cycle
        mov  rax, 1
        shl  rax, 2 ; 3 cycles

        mov  eax, 1
        mov  ecx, 1
        shl  rax, cl ; 1 cycle
        mov  rax, 1
        mov  ecx, 1
        shl  rax, cl ; 3 cycles
        mov  eax, 1
        mov  rcx, 1
        shl  rax, cl ; 3 cycles

        mov  ebx, 1
        mov  ecx, 1
        shlx rax, rbx, rcx ; 1 cycle
        mov  rbx, 1
        mov  ecx, 1
        shlx rax, rbx, rcx ; 3 cycles
        mov  ebx, 1
        mov  rcx, 1
        shlx rax, rbx, rcx ; 3 cycles

As for why this happens, your guess is as good as mine. It's possible that dynamic shifts are already so expensive that the extra addition pushes them over 1 cycle. It's possible that a bug in the hardware forced these shifts to become more expensive as a workaround. Or maybe Intel was running out of transistors after supporting every other opcode and decided shifts weren't important enough. I have no idea, but if anyone does I'd love to hear about it.