Analysing a Benchmark on LLVM RISCV

Welcome

April 5, 2026·

This was my attempt at improving a benchmark on LLVM RISCV. Unfortunately, the solution to this problem is kind of out of my league. But the attempt does make for a good blog post.

Introduction

Recently, I found this blog post by Luke Lau, in which he shows how he was able to make improvements to the middle-end (and ultimately the generated code) in LLVM, resulting in better performance for RISC-V targets on LLVM. He does this by comparing the assembly of LLVM and GCC with the llvm test suite and analysing the cases where LLVM falls behind.

I found this whole process of looking at assembly and figuring out where the root issue to be very interesting, and this blog goes over my first thorough attempt at this.

I should also note that the company Igalia hosts these benchmarks, and runs periodic updates. You can take a look and see how LLVM compares to GCC on certain RISCV targets.

Code Analysis

This link will take you to the benchmark, but I have pasted the relevant assembly in the sections below, as well as the relevant source code. Another thing to note is that the CPU used in this benchmark is the Banana Pi F3, a 64-bit in-order CPU with RISCV’s vector extension, RVV.

I will go into a level of detail that’s probably not necessary, but as I’m still new to assembly, I found understanding what each instruction to be helpful. If you’re already familiar with assembly or not really interested in the nitty gritty, feel free to skim over this section.

Source

for (k=0; k<1000; k++) {
  for (i = n-1; i >= 0; i--) {
    y[i] += x[i];
  }
}

Note

The inner loop here does not have any dependency on the outer loop (as in k is not use as a subscript in the inner loop).

This is the results of the benchmark.

LLVM


BB1: 
li	a3, 0x0      	
mv	a4, s0

BB2: 
vsetvli	a5, a4, e32, m2, ta, ma      	
not	a2, a3      	
add	a2, a2, s0      	
sub	a1, t0, a5      	
sh2add	a0, a2, s2      	
sh2add	a2, a2, s1      	
sh2add	a0, a1, a0      	
sh2add	a1, a1, a2      	
vle32.v	v8, (a0)      	
vle32.v	v10, (a1)      	
sub	a4, a4, a5      	
vadd.vv	v8, v10, v8      	
vse32.v	v8, (a1)      	
add	a3, a3, a5      	
bnez	a4, 0x7b2 <main+0x7a>

BB3:
addiw	a7, a7, 0x1      	
bne	a7, a6, 0x7ae <main+0x76>

    graph TD
    BB1 --> BB2
    BB2 --> BB3
    BB3 --> BB2
    BB2 --> BB2

Info

A Basic Block is a sequence of instructions in a program that executes linearly, meaning it has exactly one entry point at the start and one exit point at the end. Because there are no internal jumps or branches, once the first instruction is triggered, the entire block is guaranteed to execute in order.

This is a bit of a doozy, but it’s not too bad when you match it with the source code. The middle basic block is basically doing the third line of the code snippet provided.

y[i] += x[i];

The four sh2add instructions are getting the location of the addresses of y[i] and x[i]. They perform the following logic:

\[ Op0 = Op1 \times 4 + Op2 \]

In this case, Op2 is the base address and Op1 is the index i or induction variable, and multiplying by four (an integer is 4 bytes), will get us the correct address. We see four of these instructions since there’s two arrays and the inner for loop is decrementing in value.

Notice how third sh2add is writing it’s result into a0, and how the result of the second sh2add is just needed as an operand for the fourth sh2add. Essentially, the registers to take note of are a0, and a1, as these are where the addresses were stored in.

After that, we can see the two vector load instructions vle32.v, and a vector store instruction later vse32.v. At this point, you may be able to surmise that LLVM was able to vectorize this loop, as in was able to use the RVV extension of the BPI-F3 cpu to operate on multiple elements per iteration. Pretty cool!

GCC

BB1:
lw	a3,0(a1)	
lw	a4,0(a2)	
li	a5,500	
slliw	a3,a3,0x1

BB2: 
addiw	a5,a5,-1	
addw	a4,a4,a3	
bnez	a5,1064e <main+0x8e

BB3:
sw	a4,0(a2)	
addi	a2,a2,-4	
addi	a1,a1,-4	
bne	a2,a0,10642 <main+0x82>

    graph TD
    BB1 --> BB2
    BB2 --> BB3
    BB3 --> BB2
    BB2 --> BB2

What GCC did here was very clever. Essentially, GCC’s middle-end performed what’s called a Loop Interchange. It swapped the inner loop for the outer loop, resulting in something would look like the following in code.

for (i = n-1; i >= 0; i--) {
  for (k=0; k<1000; k++) {
    y[i] += x[i];
  }
}

Why is this better? Take a look at the CFG and assembly and notice how the second basic block, BB2, in both the GCC assembly and LLVM points to itself. It’s more obvious if you look at the website, as it clearly shows those basic blocks as hotspots, in which the program spends most of its cycles on these basic blocks. In this case, both programs spend 99% of their cycles on BB2.

By doing the interchange, GCC has only add operations in BB2, while LLVM has two loads and a store. Meaning that when Banana Pi runs the code produced by clang, it will get a lot of cache misses and stalls. If you look at the image of the benchmark results, you can see that despite LLVM issuing less instructions, it still takes several more cycles. This is because the CPU has a limited size cache, and if the memory it’s looking for is not in cache, it has to go down the cache hierarchy to look for it, and each tier of memory is significantly slower.

I found it pretty cool to visually see the direct impact your code can have on the performance, and how compilers can significantly improve performance. In this case, just reordering the loop can significantly increase the performance.

Loop Interchange

Ok great, the solution is to simply run loop interchange on LLVM as well. Unfortunately, the loop interchange is not enabled by default. I won’t even bother to summarize why as it seems quite complicated, but take a look at the discourse if you’re curious. Suffice to say, it seems like there’s still work be done on the pass.

But I was curious - I wanted to take a look at the LLVM IR of the interchanged loop. I ran the following command to get emit the LLVM IR before the Loop Vectorizer pass.

../bin/clang --target=riscv64-linux-gnu \
    -march=rva22u64_v -mcpu=spacemit-x60 -O3 \
    --sysroot=/usr/riscv64-linux-gnu \
    -mllvm -print-before=loop-vectorize \
    -mllvm -print-module-scope \
    -S -emit-llvm ary3.c -o ary3_prevec.ll

Then I ran the following command to get the LLVM IR, as well as the debug output, of the loop interchange pass and DependenceAnalysis, which supports the Loop interchange pass.

../bin/opt -passes='loop-interchange' \
    -mtriple=riscv64 -mattr=+v \
    -debug-only=loop-interchange,da \
    -S ary3_prevec.ll -o ary3_interchange.ll 2> interchange_log.txt

This ultimately did not change the LLVM IR, indicating that loop interchange did not occur.

Troubleshooting

Taking a look at the debug message, I noticed this excerpt which proved to be key in understanding why it did not work. By the way, for.bod14.us is the name of the basic block for the inner for loop, equivalent to BB2 above.

Dependency matrix before interchange:
* =
... 
Failed interchange InnerLoopId = 1 and OuterLoopId = 0 due to dependence
Cannot prove legality, not interchanging loops 'for.cond11.preheader.us' and 'for.body14.us'

I’ve listed some code from LoopInterchange.cpp that should explain why the pass failed to run.

 for (unsigned II = 1; II <= Levels; ++II) {
  // `DVEntry::LE` is converted to `*`. This is because `LE` means `<`
  // or `=`, for which we don't have an equivalent representation, so
  // that the conservative approximation is necessary. The same goes for
  // `DVEntry::GE`.
  // TODO: Use of fine-grained expressions allows for more accurate
  // analysis.
  unsigned Dir = D->getDirection(II);
  if (Dir == Dependence::DVEntry::LT)
    Direction = '<';
  else if (Dir == Dependence::DVEntry::GT)
    Direction = '>';
  else if (Dir == Dependence::DVEntry::EQ)
    Direction = '=';
  else
    Direction = '*';
  Dep.push_back(Direction);
 }

And this code, which is called during the legality check.

static std::optional<bool>
isLexicographicallyPositive(ArrayRef<char> DV, unsigned Begin, unsigned End) {
  for (unsigned char Direction : DV.slice(Begin, End - Begin)) {
    if (Direction == '<')
      return true;
    if (Direction == '>' || Direction == '*')
      return false;
  }
  return std::nullopt;
}

Basically, the nested for loop in the source code was designated with a dependency matrix of * =, which each character representing the dependency direction of each loop level. So the outer loop has a dependency direction of * and the inner loop has a dependency direction of =. I’m still don’t fully understand all the terminology used in this pass. Take a look at my discourse post where a user, @kasuga-fj, was kind enough to break this down for me.

I’ve not checked the details, but at a glance, I think the dependency direction for the outer loop should be DVEntry::ALL. The direction represents the possible relationships between the values of k when the same memory location is accessed. More precisely, in this case, it answers this question: when regarding the access to y as a function f(k, i), what are the possible relations between k0 and k1 when f(k0, i) = f(k1, i) for some i?

Here, f(k, i) = i. So, for example, f(0, 0) = f(1, 0) = … = f(999, 0). Thus the outer loop has a dependency with all directions.

…

I believe the right direction is to allow swapping the adjacent loops if the direction vector is * = or = *. But this is also non-trivial in terms of soundness, and we need to think this through a bit more…

His proposed solution is to allow the swapping of the dependency matrix given it’s *=. As shown in isLexicographicallyPositive, this check fails since it’s specifically looks for < character as the leftmost character. I’m guessing this hasn’t been implemented yet since it would probably cause a lot of regressions and/or would require more discussion among the community.

Conclusion

All in all, this process was pretty rewarding. I couldn’t land a change to improve or fix this, nor did I did discover a missing case, but I do think this project was still worth it.

My intro to the RISCV LLVM Backend: modifying an existing LLVM pass to merge GPRPair moves Tracking down a 25% Regression on LLVM RISC-V