Analysing a Benchmark on LLVM RISCV
This was my attempt at improving a benchmark on LLVM RISCV. Unfortunately, the solution to this problem is kind of out of my league. But the attempt does make for a good blog post.
Introduction
Recently, I found this blog post by Luke Lau, in which he shows how he was able to make improvements to the middle-end (and ultimately the generated code) in LLVM, resulting in better performance for RISC-V targets on LLVM. He does this by comparing the assembly of LLVM and GCC with the llvm test suite and analysing the cases where LLVM falls behind.
I found this whole process of looking at assembly and figuring out where the root issue to be very interesting, and this blog goes over my first thorough attempt at this.
I should also note that the company Igalia hosts these benchmarks, and runs periodic updates. You can take a look and see how LLVM compares to GCC on certain RISCV targets.
Code Analysis
This link will take you to the benchmark, but I have pasted the relevant assembly in the sections below, as well as the relevant source code. Another thing to note is that the CPU used in this benchmark is the Banana Pi F3, a 64-bit in-order CPU with RISCV’s vector extension, RVV.
I will go into a level of detail that’s probably not necessary, but as I’m still new to assembly, I found understanding what each instruction to be helpful. If you’re already familiar with assembly or not really interested in the nitty gritty, feel free to skim over this section.
for (k=0; k<1000; k++) {
for (i = n-1; i >= 0; i--) {
y[i] += x[i];
}
}Note
The inner loop here does not have any dependency on the outer loop (as in k is not use as a subscript in the inner loop).
This is the results of the benchmark.

LLVM
BB1:
li a3, 0x0
mv a4, s0
BB2:
vsetvli a5, a4, e32, m2, ta, ma
not a2, a3
add a2, a2, s0
sub a1, t0, a5
sh2add a0, a2, s2
sh2add a2, a2, s1
sh2add a0, a1, a0
sh2add a1, a1, a2
vle32.v v8, (a0)
vle32.v v10, (a1)
sub a4, a4, a5
vadd.vv v8, v10, v8
vse32.v v8, (a1)
add a3, a3, a5
bnez a4, 0x7b2 <main+0x7a>
BB3:
addiw a7, a7, 0x1
bne a7, a6, 0x7ae <main+0x76>
graph TD
BB1 --> BB2
BB2 --> BB3
BB3 --> BB2
BB2 --> BB2
Info
A Basic Block is a sequence of instructions in a program that executes linearly, meaning it has exactly one entry point at the start and one exit point at the end. Because there are no internal jumps or branches, once the first instruction is triggered, the entire block is guaranteed to execute in order.
This is a bit of a doozy, but it’s not too bad when you match it with the source code. The middle basic block is basically doing the third line of the code snippet provided.
y[i] += x[i];The four sh2add instructions are getting the location of the addresses of y[i] and x[i]. They perform the following logic:
In this case, Op2 is the base address and Op1 is the index i or induction variable, and multiplying by four (an integer is 4 bytes), will get us the correct address. We see four of these instructions since there’s two arrays and the inner for loop is decrementing in value.
Notice how third sh2add is writing it’s result into a0, and how the result of the second sh2add is just needed as an operand for the fourth sh2add. Essentially, the registers to take note of are a0, and a1, as these are where the addresses were stored in.
After that, we can see the two vector load instructions vle32.v, and a vector store instruction later vse32.v.
At this point, you may be able to surmise that LLVM was able to vectorize this loop, as in was able to use the RVV extension of the BPI-F3 cpu to operate on multiple elements per iteration. Pretty cool!
GCC
BB1:
lw a3,0(a1)
lw a4,0(a2)
li a5,500
slliw a3,a3,0x1
BB2:
addiw a5,a5,-1
addw a4,a4,a3
bnez a5,1064e <main+0x8e
BB3:
sw a4,0(a2)
addi a2,a2,-4
addi a1,a1,-4
bne a2,a0,10642 <main+0x82>
graph TD
BB1 --> BB2
BB2 --> BB3
BB3 --> BB2
BB2 --> BB2
What GCC did here was very clever. Essentially, GCC’s middle-end performed what’s called a Loop Interchange. It swapped the inner loop for the outer loop, resulting in something would look like the following in code.
for (i = n-1; i >= 0; i--) {
for (k=0; k<1000; k++) {
y[i] += x[i];
}
}Why is this better? Take a look at the CFG and assembly and notice how the second basic block, BB2, in both the GCC assembly and LLVM points to itself. It’s more obvious if you look at the website, as it clearly shows those basic blocks as hotspots, in which the program spends most of its cycles on these basic blocks. In this case, both programs spend 99% of their cycles on BB2.
By doing the interchange, GCC has only add operations in BB2, while LLVM has two loads and a store. Meaning that when Banana Pi runs the code produced by clang, it will get a lot of cache misses and stalls. If you look at the image of the benchmark results, you can see that despite LLVM issuing less instructions, it still takes several more cycles. This is because the CPU has a limited size cache, and if the memory it’s looking for is not in cache, it has to go down the cache hierarchy to look for it, and each tier of memory is significantly slower.
I found it pretty cool to visually see the direct impact your code can have on the performance, and how compilers can significantly improve performance. In this case, just reordering the loop can significantly increase the performance.
Loop Interchange
Ok great, the solution is to simply run loop interchange on LLVM as well. Unfortunately, the loop interchange is not enabled by default. I won’t even bother to summarize why as it seems quite complicated, but take a look at the discourse if you’re curious. Suffice to say, it seems like there’s still work be done on the pass.
But I was curious - I wanted to take a look at the LLVM IR of the interchanged loop. I ran the following command to get emit the LLVM IR before the Loop Vectorizer pass.
../bin/clang --target=riscv64-linux-gnu \
-march=rva22u64_v -mcpu=spacemit-x60 -O3 \
--sysroot=/usr/riscv64-linux-gnu \
-mllvm -print-before=loop-vectorize \
-mllvm -print-module-scope \
-S -emit-llvm ary3.c -o ary3_prevec.llThen I ran the following command to get the LLVM IR, as well as the debug output, of the loop interchange pass and DependenceAnalysis, which supports the Loop interchange pass.
../bin/opt -passes='loop-interchange' \
-mtriple=riscv64 -mattr=+v \
-debug-only=loop-interchange,da \
-S ary3_prevec.ll -o ary3_interchange.ll 2> interchange_log.txtThis ultimately did not change the LLVM IR, indicating that loop interchange did not occur.
Troubleshooting
Taking a look at the debug message, I noticed this excerpt which proved to be key in understanding why it did not work. By the way, for.bod14.us is the name of the basic block for the inner for loop, equivalent to BB2 above.
Dependency matrix before interchange:
* =
...
Failed interchange InnerLoopId = 1 and OuterLoopId = 0 due to dependence
Cannot prove legality, not interchanging loops 'for.cond11.preheader.us' and 'for.body14.us'I’ve listed some code from LoopInterchange.cpp that should explain why the pass failed to run.
for (unsigned II = 1; II <= Levels; ++II) {
// `DVEntry::LE` is converted to `*`. This is because `LE` means `<`
// or `=`, for which we don't have an equivalent representation, so
// that the conservative approximation is necessary. The same goes for
// `DVEntry::GE`.
// TODO: Use of fine-grained expressions allows for more accurate
// analysis.
unsigned Dir = D->getDirection(II);
if (Dir == Dependence::DVEntry::LT)
Direction = '<';
else if (Dir == Dependence::DVEntry::GT)
Direction = '>';
else if (Dir == Dependence::DVEntry::EQ)
Direction = '=';
else
Direction = '*';
Dep.push_back(Direction);
}And this code, which is called during the legality check.
static std::optional<bool>
isLexicographicallyPositive(ArrayRef<char> DV, unsigned Begin, unsigned End) {
for (unsigned char Direction : DV.slice(Begin, End - Begin)) {
if (Direction == '<')
return true;
if (Direction == '>' || Direction == '*')
return false;
}
return std::nullopt;
}Basically, the nested for loop in the source code was designated with a dependency matrix of * =, which each character representing the dependency direction of each loop level. So the outer loop has a dependency direction of * and the inner loop has a dependency direction of =. I’m still don’t fully understand all the terminology used in this pass. Take a look at my discourse post where a user, @kasuga-fj, was kind enough to break this down for me.
I’ve not checked the details, but at a glance, I think the dependency direction for the outer loop should be DVEntry::ALL. The direction represents the possible relationships between the values of k when the same memory location is accessed. More precisely, in this case, it answers this question: when regarding the access to y as a function f(k, i), what are the possible relations between k0 and k1 when f(k0, i) = f(k1, i) for some i?
Here, f(k, i) = i. So, for example, f(0, 0) = f(1, 0) = … = f(999, 0). Thus the outer loop has a dependency with all directions.
…
I believe the right direction is to allow swapping the adjacent loops if the direction vector is * = or = *. But this is also non-trivial in terms of soundness, and we need to think this through a bit more…
His proposed solution is to allow the swapping of the dependency matrix given it’s *=. As shown in isLexicographicallyPositive, this check fails since it’s specifically looks for < character as the leftmost character. I’m guessing this hasn’t been implemented yet since it would probably cause a lot of regressions and/or would require more discussion among the community.
Conclusion
All in all, this process was pretty rewarding. I couldn’t land a change to improve or fix this, nor did I did discover a missing case, but I do think this project was still worth it.