The Alder Lake shlx
anomaly
At the end of 2024, Harold Aptroot posted this:
Apparently shlx is a "medium latency" (3 cycles) instruction on Alder Lake. My disappointment is immeasurable, and my day is ruined.
I was immediately nerd sniped because I am into low-level performance analysis, and I happen to own an Alder Lake laptop.
A bit of background: Alder Lake is the 12th generation of Intel Core processors.
It's the first generation with a "hybrid architecture," containing both performance (P) and efficiency (E) cores.
shlx
is a left-shift instruction introduced in the BMI2 instruction set.
The main difference with SHL
is that shlx
doesn't affect the flags
register.
It's also a 3-operand instruction:
shl rax, cl ; rax = rax << cl
; (only cl allowed as shift count)
shlx rax, rbx, rdx ; rax = rbx << rdx
; (any register allowed as shift count)
Left-shift is one of the simplest things to implement in hardware, so it's quite surprising that it should take 3 whole CPU cycles. It's been 1 cycle on every other CPU I'm aware of. It's even 1 cycle on Alder Lake's efficiency cores! Only the performance cores have this particular performance problem.
The 3-cycle figure Harold cited comes from uops.info. They even document the exact instruction sequence used in their benchmark that measured the 3-cycle latency, with a sample nanoBench command to reproduce it. Running that command on my laptop indeed measures 3 cycles of latency.
On the other hand, other sources (like Intel and InstLatX64) claim the latency is 1 cycle. What gives? I decided to write my own benchmark to try to understand the discrepancy.
.intel_syntax noprefix
.globl main
main:
mov rdx, 10000 ; rdx = 10000
xor rax, rax ; rax = 0
.LOOP:
mov rcx, 1 ; rcx = 1
.rept 10000
shlx rax, rax, rcx ; rax = rax << rcx
; (repeated 10,000 times)
.endr
dec rdx
jnz .LOOP ; (loop 10,000 times)
xor eax, eax
ret ; return 0
This code contains an outer loop with 10,000 iterations.
Inside the loop, we initialize rcx
to 1, then run shlx rax, rax, rcx
10,000 times.
In total, we run shlx
10,000,000 times, so all the other instructions (including the ones before main()
runs) are negligible.
I used taskset -c 0
to pin it to a P core, and perf
for measurement:
$ gcc shlx.s -o shlx
$ taskset -c 0 perf stat --cputype=core -e 'cycles,instructions' ./shlx
Performance counter stats for './shlx':
301,614,809 cpu_core/cycles:u/
100,155,910 cpu_core/instructions:u/ # 0.33 insn per cycle
Here we see 0.33 instructions per cycle, a.k.a. 3-cycle latency.
Let's try initializing rcx
differently:
.LOOP:
- mov rcx, 1
+ mov ecx, 1
ecx
is the 32-bit low half of the 64-bit rcx
register.
On x86-64, writing a 32-bit register implicitly sets the upper half of the corresponding 64-bit register to zero.
So these two instructions should behave identically.
And yet:
Performance counter stats for './shlx':
100,321,870 cpu_core/cycles:u/
100,155,867 cpu_core/instructions:u/ # 1.00 insn per cycle
It seems like shlx
performs differently depending on how the shift count register is initialized.
If you use a 64-bit instruction with an immediate, performance is slow.
This is also true for instructions like inc
(which is similar to add
with a 1 immediate).
.LOOP:
- mov rcx, 1
+ xor rcx, rcx
+ inc rcx
Performance counter stats for './shlx':
300,138,108 cpu_core/cycles:u/
100,165,881 cpu_core/instructions:u/ # 0.33 insn per cycle
On the other hand, 32-bit instructions, and 64-bit instructions without immediates (even no-op ones), make it fast. All of these ways to initialize rcx lead to 1-cycle latency:
.LOOP:
mov ecx, 1
.LOOP:
xor rcx, rcx
.LOOP:
mov rcx, 1
mov rcx, rcx
mov rcx, 1
.LOOP:
push rcx
pop rcx
It is very strange to me that the instruction used to set the shift count register can make the shlx
instruction 3× slower.
The 32-bit vs. 64-bit operand size distinction is especially surprising to me as shlx
only looks at the bottom 6 bits of the shift count.
I do not have a good explanation for this yet, but I will update this page if I ever figure it out.