Tracking down a 25% Regression on LLVM RISC-V
Similar to the previous post, this post covers my analysis of a benchmark on RISC-V targets. Unlike the previous post, I was able to land a patch to eliminate the performance gap to GCC (for this benchmark)!
Analysis
I was looking at Igalia’s site comparing the performance of LLVM to GCC on RISCV targets, and I noticed this particular benchmark.
![]()
As shown in the image below, LLVM requires about ~8% more cycles than GCC for that specific benchmark on the SiFive P550 CPU.
I have included snippets of the relevant basic block assembly. Practically all the cycles were spent on the assembly below.
LLVM
![]()
GCC
![]()
From the two assembly, it wasn’t immediately obvious to me why GCC was doing better. They seemed almost identical, and if anything, LLVM was able to optimize the branch logic here. The big difference I did notice was that LLVM was doing an fdiv.d, a division with double precision floating point or an f64.
This did seem promising so I decided to run llvm-mca on the source code to get a better idea of what is happening. Note that I am using llvm-mca built from source that was a couple of days old from upstream. From the analysis done by llvm-mca, I noticed that the fdiv.d instruction made no appearance in the loop. Although I did not show it above, both GCC and LLVM contained an fdiv.d instruction in a later basic block but, this was outside the main loop and therefore not relevant to the performance difference.
Info
llvm-mca is a tool within the llvm suite that can be used to statically measure the performance of machine code for a specific CPU.
$LLVM_BUILD_DIR/bin/llvm-mca -mtriple=riscv64 -mcpu=sifive-p550 pi.s ![]()
Around the area where there should have been an fdiv.d, there were two fdiv.s like GCC. From this, I concluded that the fdiv.d instruction in the loop must have been a recent regression. LLVM used to be able to narrow the double to a float, but the latest builds can no longer convert the double into a float.
I confirmed this was indeed a regression by comparing LLVM to a prior build for the same CPU and benchmark.
https://cc-perf.igalia.com/db_default/v4/nts/profile/260/406/4
![]()
Below is the assembly generated by the prior build of LLVM.
![]()
No fdiv.d instruction on the prior build. In its place, an fdiv.s is used.
Out-of-Order Execution
If you are wondering as to why the ordering of the assembly above and the assembly from the new LLVM build look different, it’s because the target CPU, the SiFive P550, is an out of order CPU. Unlike the CPU in the Banana Pi mentioned in the previous post which was an in-order CPU, this target can execute instructions in an order that can produce a higher throughput.
I’m not exactly sure why the fdiv.d is higher, but my suspicions would be that since the double division has a significantly higher latency, the CPU is trying to dispatch other instructions to ‘hide’ this latency. The fcvt.s.d instruction that consumes the value of the fdiv.d, ft1, would need to wait for 33 cycles, so perhaps the CPU made the decision to schedule instructions between them.
Where is this happening?
At this point in time we don’t know why this happened but a step in the right direction would be to figure out where it happens. Perhaps it was some change in the RISCV backend? The command below gives us the commits related to RISC-V.
git log --after="2026-04-01 00:11" --before="2026-04-04 00:10" | grep -E "RISC"[RISCV] Select add(vec, splat(scalar)) to PADD_*S for P extension (#190303)
[RISCV] Allow coalesceVSETVLIs to move an LI if it allows a vsetvli to be mutated. (#190287)
[RISCV][TTI] Update cost and prevent exceed m8 for vector.extract.last.active (#188160)
[RISCV] Check EnsureWholeVectorRegisterMoveValidVTYPE in RISCVInsertVSETVLI::transferBefore. (#190022)
[RISCV] Remove codegen for vp_ctlz, vp_cttz, vp_ctpop (#189904)
Part of the work to remove trivial VP intrinsics from the RISC-V
[RISCV] Move unpaired instruction back in RISCVLoadStoreOptimizer (#189912)
`RISCVLoadStoreOptimizer` moves the instruction adjacent to the other
[RISCV] Fix stackmap shadow trimming NOP size for compressed targets (#189774)
[RISCV] Relax VL constraint in convertSameMaskVMergeToVMv (#189797)
[RISCV] Add SATI_RV64/USATI_RV64 to RISCVOptWInstrs. (#190030)
and the RISCVISD::SATI encoding uses the type width minus one.
[RISCV][MCA] Update sifive-p670 tests to consume input files instead (#189785)
[RISCV] Remove codegen for vp_minnum, vp_maxnum (#189899)
Part of the work to remove trivial VP intrinsics from the RISC-V
[RISCV] Add RISCVISD::USATI/SATI to computeKnownBitsForTargetNode/ComputeNumSignBitsForTargetNode. (#189702)
[RISCV] Add assertions to VSETVLIInfo::hasSEWLMULRatioOnly(). NFC (#189799)
[RISCV] combine-is_fpclass.ll - add initial tests showing failure to constant fold ISD::IS_FPCLASS nodes (#189940)
[RISCV] Remove codegen for VP float rounding intrinsics (#189896)
Part of the work to remove trivial VP intrinsics from the RISC-V
[RISCV] Remove codegen for vp_lrint, vp_llrint (#189714)
Part of the work to remove trivial VP intrinsics from the RISC-V
[RISCV] Add codegen support for SATI and USATI. (#189532)None of these commits immediately stood out to me, so I put them off. Investigating the middle-end, I decided to look at the final LLVM IR produced by opt with my local LLVM version that was a few days old. Remember that my build is the ‘working’ build. Here is the LLVM IR produced at the beginning of the pipeline (right after clang), and at the end.
Gets the LLVM IR right after the pipeline:
$LLVM_BUILD_DIR/bin/clang -O3 \
--target=riscv64-unknown-linux-gnu \
-march=rv64gc_zba_zbb \
--sysroot=/usr/riscv64-linux-gnu \
-Xclang -disable-llvm-passes \
-S -emit-llvm pi.c -o pi_raw_old.llGets the LLVM IR at the end of the optimization pipeline:
$LLVM_BUILD_DIR/bin/clang -O3 \
--target=riscv64-unknown-linux-gnu \
-march=rv64gc_zba_zbb \
--sysroot=/usr/riscv64-linux-gnu \
-S -emit-llvm pi.c -o pi.ll
graph LR
%% Main Compiler Flow
Source[Source Code<br/><b>pi.c</b>] --> FE[<b>Front-End</b><br/><i>Clang/Lexer/Parser</i>]
subgraph MiddleEnd ["<b>Middle-End</b> (Optimization Pipeline)"]
direction LR
ME_Start(<b>Beginning of Middle-End</b><br/><i>Unoptimized IR</i>) --> Opts[<i>InstCombine, LoopUnroll, GVN, etc.</i>]
Opts --> ME_End(<b>End of Middle-End</b><br/><i>Optimized IR</i>)
end
FE --> ME_Start
ME_End --> BE[<b>Backend</b><br/><i>Code Generator/Instruction Selection</i>]
BE --> Asm(<b>Assembly</b><br/><i>RISC-V</i>)
%% Styling Main Pipeline
style Source fill:none,stroke:#888,stroke-width:1px
style FE fill:none,stroke:#888,stroke-width:1px
style ME_Start fill:none,stroke:#1e88e5,stroke-width:2px
style ME_End fill:none,stroke:#1e88e5,stroke-width:2px
style Opts fill:none,stroke:#888,stroke-width:1px,stroke-dasharray: 5 5
style BE fill:none,stroke:#888,stroke-width:1px
style Asm fill:none,stroke:#888,stroke-width:1px
style MiddleEnd fill:none,stroke:#444,stroke-width:1px,stroke-dasharray: 3 3
%% Compact Command Boxes
%% We use <div> and <code> to keep things tight and prevent block-level gaps
Cmd_Raw["<div style='line-height:1.2;'><b>Command 1: Raw IR</b><br/><code style='background:none;color:inherit;'>-Xclang -disable-llvm-passes</code><br/><small>Produces <b>pi_raw_old.ll</b></small></div>"]
Cmd_Opt["<div style='line-height:1.2;'><b>Command 2: Optimized IR</b><br/><code style='background:none;color:inherit;'>-O3 -S -emit-llvm</code><br/><small>Produces <b>pi.ll</b></small></div>"]
%% Connections
Cmd_Raw -.-> ME_Start
Cmd_Opt -.-> ME_End
%% Style Command Notes
style Cmd_Raw fill:#fff3e011,stroke:#f57c00,stroke-width:1px,rx:10,ry:10
style Cmd_Opt fill:#fff3e011,stroke:#f57c00,stroke-width:1px,rx:10,ry:10
If you are confused as to what I am doing, hopefully the diagram above illustrates this better. If the middle-end is responsible for narrowing the double to a float, I am trying to get an idea of what optimizations are happening to the IR that causes this.
This is the relevant snippet of the IR from the beginning of the optimization pipeline.
%conv = sitofp i64 %5 to float ; int -> float
%conv2 = fpext float %conv to double ; float -> double
%div3 = fdiv double %conv2, 7.438300e+04 ; fdiv.d
%conv4 = fptrunc double %div3 to float ; double -> floatWhich was converted into the following by the end of the pipeline.
%conv = uitofp nneg i64 %0 to float ; int -> float
%conv4 = fdiv float %conv, 7.438300e+04 ; fdiv.s
We can see that by the end of the middle-end, the double was narrowed to a float. I hope it’s clear from these snippets that it was the middle-end responsible for narrowing this initial double value to a float.
If you’re confused as to why these seemingly redundant cast operations are produced in the first place, taking a cursory glance at the source code can help.
int main(int argc, char *argv[]) {
float ztot, yran, ymult, ymod, x, y, z, pi, prod;
long int low, ixran, itot, j, iprod;
...
for(j=1; j<=itot; j++) {
iprod = 27611 * ixran;
ixran = iprod - 74383*(long int)(iprod/74383);
x = (float)ixran / 74383.0;
...
}
...
}This specific line shows that 74383.0 is a double in the source code, but it can fit within a float.
x = (float)ixran / 74383.0;This is why the LLVM IR has an fpext converting a float to a double, and then a fptrunc of the double back to a float.
This may be obvious, but I’ll state it regardless. If the literal above had an initially been a float like below, the fdiv.d would have never been produced.
x = (float)ixran / 74383.0f;Given a sufficient optimization level, the compiler should still be able to catch something like this, but it’s still cool seeing how such a little change in code can have such a huge difference. In this case, over +19% in cycles!
The table below shows the mapping of LLVM IR to the C source code, as well as a brief note on what the specified IR does.
| LLVM IR | C Source | Notes |
|---|---|---|
%mul = mul nuw nsw i64 %ixran.053, 27611 |
iprod = 27611 * ixran |
integer multiply |
%0 = urem i64 %mul, 74383 |
ixran = iprod - 74383*(long int)(iprod/74383) |
compiler optimized modulo via urem |
%conv2 = uitofp nneg i64 %0 to double |
(float)ixran |
cast to double first due to 74383.0 being a double literal |
%div3 = fdiv double %conv2, 7.438300e+04 |
/ 74383.0 |
division in double precision because 74383.0 is a double literal in C |
%conv4 = fptrunc double %div3 to float |
x = (float)... |
explicit (float) cast truncates result back to float |
The LLVM build at the time was producing the following.
%mul = mul nuw nsw i64 %ixran.053, 27611
%0 = urem i64 %mul, 74383
%conv2 = uitofp nneg i64 %0 to double
%div3 = fdiv double %conv2, 7.438300e+04
%conv4 = fptrunc double %div3 to floatWe can see that the operands of the div3 operations is %conv2, and a decimal value. %conv2 is the result of a cast operation converting %0 into a double, but the maximum value of %0, 74383, can fit within a float.
Looking at the result from llvm-mca, we can see the following for fdiv.d.
1 33 32.00 fdiv.d ft1, ft1, fa3This shows that fdiv.d has a latency of 33 cycles, significantly longer than fdiv.s’s 19 cycles. In our performance comparison, the older LLVM build reported a Reciprocal Throughput (RThroughput) of 86.0, whereas the newer build has increased to 100.0. RThroughput represents the number of clock cycles the processor must wait before it can start executing another instruction of the same type. So the lower, the better.
This shows how a double division is pretty expensive compared to a float division.
From the list of commits in those past few days, the one below immediately stood out to me.
[InstCombine] Use ComputeNumSignBits in isKnownExactCastIntToFP (#190235)
For signed int-to-FP casts, ComputeNumSignBits can prove exactness where
computeKnownBits cannot -- e.g. through ashr(shl x, a), b where sign propagation is
tracked precisely but individual known bits are all unknown.x = x * 2 to x = x << 1;Given that the newest build can no longer cast that integer to a float, this commit message with the int-to-FP casts (sitofp/uitofp) immediately roused my suspicions, and they were confirmed when I built LLVM before and after that commit. Before this change, everything worked and afterward, fdiv.d instructions appeared in the assembly.
I should also note that this patch is an improvement - the InstCombiner pass has more information available to it - but sometimes improvements can cause regression(s) elsewhere in unforeseeable ways. I would posit this gives opportunities for folks like me to contribute 😅
Why is this happening?
Looking at the diff on this commit, I can see that that the author of the patch added known bit analysis to the function isKnownExactCastIntToFP.
static bool isKnownExactCastIntToFP(CastInst &I, InstCombinerImpl &IC) {
...
// For sitofp, the sign maps to the FP sign bit, so only magnitude bits
// (BitWidth - NumSignBits) consume mantissa.
if (IsSigned) {
SigBits =
(int)SrcTy->getScalarSizeInBits() - IC.ComputeNumSignBits(Src, &I);
if (SigBits <= DestNumSigBits)
return true;
}
return false;
}You don’t need an in-depth understanding of what this changed did, but we can infer that this change is causing isKnownExactCastIntToFP to return true, when it was returning false before. So I decided to look at all the call sites of isKnownExactCastIntToFP to gain a better understanding of how this could lead to a regression.
The following function calls isKnownExactCastIntToFP. It is called to reduce an fpext instruction if a preceding instruction casts an integer to FP like so: itofp i64 x to float -> fpext float x to double. This can just be reduced to a itofp i64 x to double instead.
Instruction *InstCombinerImpl::visitFPExt(CastInst &FPExt) {
// If the source operand is a cast from integer to FP and known exact, then
// cast the integer operand directly to the destination type.
Type *Ty = FPExt.getType();
Value *Src = FPExt.getOperand(0);
if (isa<SIToFPInst>(Src) || isa<UIToFPInst>(Src)) {
auto *FPCast = cast<CastInst>(Src);
if (isKnownExactCastIntToFP(*FPCast))
return CastInst::Create(FPCast->getOpcode(), FPCast->getOperand(0), Ty);
}
return commonCastTransforms(FPExt);
}And this puts it together.
Referring the LLVM IR that was generated at the beginning of the pipeline.
%conv = sitofp i64 %5 to float ; int -> float
%conv2 = fpext float %conv to double ; float -> double
%div3 = fdiv double %conv2, 7.438300e+04 ; fdiv.d
%conv4 = fptrunc double %div3 to float ; double -> floatAnd the LLVM IR at the end of the pipeline.
%conv2 = uitofp nneg i64 %0 to double
%div3 = fdiv double %conv2, 7.438300e+04
%conv4 = fptrunc double %div3 to float
graph TD
%% Node Definitions
N1["%conv = sitofp i64 %5 to float<br/><i>int -> float</i>"]
N2["%conv2 = fpext float %conv to double<br/><i>float -> double</i>"]
N3["%div3 = fdiv double %conv2, 7.438300e+04<br/><i>fdiv.d</i>"]
N4["%conv4 = fptrunc double %div3 to float<br/><i>double -> float</i>"]
%% Data Flow
N1 --> N2
N2 --> N3
N3 --> N4
%% Subgraph
subgraph CastGroup ["Initial Casts"]
direction TB
N1
N2
end
%% Annotation Node
AnnotateNode["<b>InstCombinerCast</b><br/>Reduces this pattern to %conv2 = uitofp nneg i64 %0 to double"]
%% Styling - Using neutral strokes that work in both modes
style AnnotateNode fill:#0288d122,stroke:#0288d1,stroke-width:2px,rx:10,ry:10
style CastGroup fill:none,stroke:#888,stroke-dasharray: 5 5
%% Invisible edges for layout
AnnotateNode -.-> N1
AnnotateNode -.-> N2
The InstCombiner pass after that patch is able to optimize the sitofp i64 to float, followed by fpext float to double by reducing it a single uitofp nneg i64 %0 to double.
However, visitFPTrunc optimized the following pattern as commented in the code:
// If we have fptrunc(OpI (fpextend x), (fpextend y)), we would like to
// simplify this expression to avoid one or more of the trunc/extend
// operations if we can do so without changing the numerical results.
//
// The exact manner in which the widths of the operands interact to limit
// what we can and cannot do safely varies from operation to operation, and
// is explained below in the various case statements.
The optimized LLVM IR now no longer has that fpext instruction, so the InstCombiner pass and specifically visitFPTrunc can no longer narrow the double to a float.
Landing a solution
So we diagnosed the issue - the recent patch to InstCombine, which improved the logic, causes the pass to no longer be able to perform another optimization. We need to teach the InstCombiner to narrow an operation earlier with the uitofp/sitofp if there’s an fptrunc later anyways.
graph TD
%% Define the Instruction Spine
subgraph Flow ["LLVM Data Flow"]
direction TB
N0["%0 = urem i64 %mul, 74383<br/><i>Range Provider: 0 to 74,382</i>"]
N1["%conv2 = uitofp nneg i64 %0 to double<br/><i>Integer to Double</i>"]
N2["%div3 = fdiv double %conv2, 7.438300e+04<br/><i>Double Division</i>"]
N3["%conv4 = fptrunc double %div3 to float<br/><i>Double to Float</i>"]
%% Define the vertical chain
N0 --> N1
N1 --> N2
N2 --> N3
end
%% Define the Optimizer Node on the side
IC["<b>InstCombiner</b><br/>Reduces to float-precision<br/>math if inputs fit."]
%% Styling: Using #AARRGGBB or #RRGGBBAA logic for Hextra
%% fill:#1b5e2022 is a very transparent green (approx 13% opacity)
style IC fill:#1b5e2022,stroke:#4caf50,stroke-width:2px,rx:10,ry:10
%% Subgraph styling to keep it neutral
style Flow fill:none,stroke:#888888,stroke-dasharray: 5 5
%% Dotted arrow connections
IC -.-> N0
IC -.-> N1
IC -.-> N3
I first raised an issue on Github. While I was confident that this was a valid issue, I wanted to double check that this warranted a fix. I do want to thank the author of that patch I mentioned earlier, @SavchenkoValeriy, as he was able to offer me guidance on solving this issue, as well as providing reviews to my PR. My very initial solution would’ve been pretty convoluted but he was able to offer me a much simpler approach.
GitHub Issue: https://github.com/llvm/llvm-project/issues/190503
PR: https://github.com/llvm/llvm-project/pull/190550
As pointed out by my reviewer, the current issue with isKnownExactCastIntToFP currently only checks if the casting instruction CastInst.
%0 = urem i64 %mul, 74383
%conv2 = uitofp nneg i64 %0 to double
%div3 = fdiv double %conv2, 7.438300e+04
%conv4 = fptrunc double %div3 to floatThe code below is the visitFPTrunc function You can see the switch statement with the different operations, including the FDiv, and this corresponds to the fdiv.d/fdiv.s. FPT corresponds to the fptrunc instruction, BO corresponds to FPT.getOperand(0), so it would refer to %conv2. We need to see if we can instead convert this uitofp into a float, so we need to modify getMinimumFPType to check for cast operations as well as fpext instructions.
Instruction *InstCombinerImpl::visitFPTrunc(FPTruncInst &FPT) {
if (Instruction *I = commonCastTransforms(FPT))
return I;
...
Type *Ty = FPT.getType();
auto *BO = dyn_cast<BinaryOperator>(FPT.getOperand(0));
if (BO && BO->hasOneUse()) {
Type *LHSMinType = getMinimumFPType(BO->getOperand(0), PreferBFloat);
Type *RHSMinType = getMinimumFPType(BO->getOperand(1), PreferBFloat);
switch (BO->getOpcode()) {
default: break;
case Instruction::FAdd:
case Instruction::FSub:
...
...
}One of my initial idea was to modify isKnownExactCastIntToFP to accept a parameter with a different Type (f32 in my case) but with a default nullptr value. This would allow us to modify just the header definition and its implementation. My reviewer instead proposed the idea of making a variant of isKnownExactCastIntToFP, canBeCastedExactlyIntToFP. This will do the actual analysis with the type given to it, and isKnownExactCastIntToFP can call it. I mention this to show that it’s good to interact with the community and seek ideas from others. They can come up with ideas which for a variety of reasons could be better.
Below is the final git diff. We split isKnownExactCastIntToFP and created canBeCastedExactlyIntToFP to perform the actual analysis, and isKnownExactCastIntToFP call it. Then, we have getMinimumFPType call canBeCastedExactlyIntToFP.
index bc52bf1168d4..2688891c1509 100644
--- a/llvm/include/llvm/Transforms/InstCombine/InstCombiner.h
+++ b/llvm/include/llvm/Transforms/InstCombine/InstCombiner.h
@@ -481,6 +481,8 @@ public:
/// Return true if the cast from integer to FP can be proven to be exact
/// for all possible inputs (the conversion does not lose any precision).
bool isKnownExactCastIntToFP(CastInst &I) const;
+ bool canBeCastedExactlyIntToFP(Value *V, Type *FPTy, bool IsSigned,
+ const Instruction *CxtI = nullptr) const;
OverflowResult computeOverflowForUnsignedMul(const Value *LHS,
const Value *RHS,
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp b/llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
index e3c39a3c193e..0cd035bd1413 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
@@ -2039,10 +2039,17 @@ static Type *shrinkFPConstantVector(Value *V, bool PreferBFloat) {
}
/// Find the minimum FP type we can safely truncate to.
-static Type *getMinimumFPType(Value *V, bool PreferBFloat) {
+static Type *getMinimumFPType(Value *V, Type *PreferredTy, InstCombiner &IC) {
if (auto *FPExt = dyn_cast<FPExtInst>(V))
return FPExt->getOperand(0)->getType();
+ Value *Src;
+ if (match(V, m_IToFP(m_Value(Src))) &&
+ IC.canBeCastedExactlyIntToFP(Src, PreferredTy, isa<SIToFPInst>(V),
+ cast<Instruction>(V)))
+ return PreferredTy;
+
+ bool PreferBFloat = PreferredTy->getScalarType()->isBFloatTy();
Notice how we now call canBeCastedExactlyIntToFP with the Type of the fptrunc instruction being passed in. Src in this case is the input to BO->getOperand(0), and BO is the 0th operand of the fptrunc instruction. Remember the llvm ir earlier:
%0 = urem i64 %mul, 74383
%conv2 = uitofp nneg i64 %0 to double
%div3 = fdiv double %conv2, 7.438300e+04
%conv4 = fptrunc double %div3 to floatThe fptrunc is converting the input into a float, so the type Ty is f32. BO is %div3, BO->getOperand(0) is %conv2 and the m_IToFP(m_Value(Src)) puts the input of BO->getOperand(0) into Src. The table below shows the mapping of the visitFPTrunc code to the LLVM IR variables.
visitFPTrunc Variables |
LLVM IR Variables | Value |
|---|---|---|
BO |
%div3 |
fdiv double %conv2, 7.438300e+04 |
BO->getOperand(0) |
%conv2 |
uitofp nneg i64 %0 to double |
BO->getOperand(1) |
7.438300e+04 |
double constant |
Src |
%0 |
urem i64 %mul, 74383 |
Result
Did this work? By running the command below we can see the LLVM IR after my patch.
$LLVM_BUILD_DIR/bin/clang -O3 \
--target=riscv64-unknown-linux-gnu \
-march=rv64gc_zba_zbb \
--sysroot=/usr/riscv64-linux-gnu \
-S -emit-llvm pi.c -o pi_fixed.llAnd this is the relevant LLVM IR.
%mul = mul nuw nsw i64 %ixran.053, 27611
%0 = urem i64 %mul, 74383
%1 = uitofp nneg i64 %0 to float
%conv4 = fdiv float %1, 7.438300e+04It worked! 🥹
The fptrunc is gone, and the cast operation uitofp now converts it into a float.
Shown below are the results after my patch was merged.
![]()