My First LLVM PR
Honestly, it’s actually my third PR, but I think it’s worth writing a blog about and the title sounds better then ‘My Third LLVM PR’.
It’s also a revival of the work done by @huhu233, who implemented the feature of this PR for the ARM backend. I also received a lot of help from the PR reviewer to ensure the code is up to LLVM’s standards.
What & Why LLVM?
It’s essentially a framework that can be used to develop a programming language. It’s used in clang (C/C++), rust, zig (I think they’re migrating though). Also commonly used in gpu computing software, like CUDA and HIP. It’s fair to say it’s quite commonplace now in the compiler world.
At a very high level birds-eye overview, LLVM converts the front-end language (C++, rust, CUDA, etc.) into LLVM IR (Intermediate Representation). LLVM IR is sort of its own language, but understanding LLVM IR isn’t necessary to understand the work I have done, so I won’t try to explain it here. The magic of having the intermediate representation is it allows for language-agnostic optimizations that can benefit every front-end.
And the great thing about the llvm community is that they’re quite friendly to newcomers. Most importantly, there’s is a regular supply of Good first issue on the llvm-project, making it easy for me to start contributing.
Brief SelectionDAG Overview
I think this is a great resource for beginners to understand how SelectionDAG works. I’ll try my best give a five minute crash course below to get a better grasp on what my PR does.
graph LR
B[Frontend] -->|LLVM IR| D[opt]
D -->|LLVM IR| F[llc]
F --> G[Assembly]
H[SelectionDAG] -.-> F
From the diagram above, you can see that SelectionDAG is used when converting LLVM IR to the final assembly. It ensures the LLVM IR generated from opt is suitable for the target used. For example, SelectionDAG will perform Type Legalization, where it ensures the type used in the LLVM IR is actually supported on the target such as an i24being promoted to an i32 for X86. This is just an example, and the video linked before goes much into details of the other things SelectionDAG does to ensure legal instructions are generated for the target.
SDValue
The SelectionDAG datastructure is composed of nodes which can represent the actual operations in the IR, such as ADD, LOAD, OR STORE. SDValue represents the output and inputs of these nodes.
MVT
MVT is the union of tpyes that are supported by each targets that uses SelectionDAG. So an MVT could be an i32, f16, v2f32, and the list goes on. You may also encounter an EVT, which is union of the MVT types and all integer, float, and vector types supported by LLVM IR.
FLDEXP & SCALEF
ldexp is a function in C++ which multiplies a floating-point number by an integral power of two.
It allows for the precise manipulation of floating-point numbers by directly changing the exponent without touching the mantissa. This avoids rounding errors from repeated multiplication by 2, and ensures exact powers-of-two scaling in binary floating-point
import math
math.ldexp(0.75, 2) # 0.75 * 2^2 = 3.0Mantissa & Exponent
A floating-point number represents a value as:
value = mantissa × 2^exponent
- Mantissa (significand): holds the significant digits
- Exponent: scales the mantissa by a power of 2
Example:
Decimal 6.5 → Binary 110.1₂ → 1.101 × 2²
Mantissa: 1.101, Exponent: 2
\[ Mantissa: 101 \]
Key point: Efficiently represents very large or very small numbers.
Functions like ldexp(x, n) scale the number precisely by powers of 2.
Normally, when such a function gets lowered into assembly, it would generate a lot of assembly since there is no direct mapping. However, the latest X86 CPU have an extension for vector SIMD operations, called AVX-512. Within the AVX-512 extension, is an instruction called VSCALEF that essentially performs ldexp on multiple floating point values with a single instruction. As the name implies, the extension comes with 512-bit registers, allowing the CPU to perform instructions on multiple 64/32-bit elements at once.
AVX512 Feature Subsets
AVX-512 includes a collections of features that a target can include.
-
AVX512F (Foundation): The base AVX-512 instruction set. Includes 512-bit operations for 32-bit and 64-bit floats/integers, mask registers (k0-k7), and basic operations like VSCALEFPS/VSCALEFPD for 512-bit vectors.
-
AVX512VL (Vector Length Extensions): Enables AVX-512 instructions on smaller vector sizes (128-bit and 256-bit). Without VLX, AVX-512 instructions only work on 512-bit (
ZMM) registers. With VLX, you can use AVX-512 features onXMM(128-bit) and `YMM`` (256-bit) registers. -
AVX512FP16 (Half Precision): Adds native support for 16-bit floating-point operations, including VSCALEFPH/VSCALEFSH for half-precision SCALEF operations.
Without VLX, 128-bit and 256-bit vector operations must be widened to 512-bit before using VSCALEF and without FP16, vectors must be widened to f32.
| Instruction | Packed/Scalar | Element Type | Widths Supported | Required Features |
|---|---|---|---|---|
| VSCALEFPD | Packed | FP64 | 128 / 256 / 512 | AVX512F, AVX512VL (for 128/256) |
| VSCALEFPH | Packed | FP16 | 128 / 256 / 512 | AVX512FP16, AVX512VL (for 128/256) |
| VSCALEFPS | Packed | FP32 | 128 / 256 / 512 | AVX512F, AVX512VL (for 128/256) |
| VSCALEFSD | Scalar | FP64 | 128 | AVX512F |
| VSCALEFSH | Scalar | FP16 | 128 | AVX512FP16 |
| VSCALEFSS | Scalar | FP32 | 128 | AVX512F |
LowerFLDEXP
Now to get onto what my PR actually does. On the X86 backend, when the target has the AVX-512 extension, ldexp should get lowered to VSCALEF. As mentioned before, AVX-512 adds 512-bit SIMD registers and vector instructions, but there are also subsets or extensions to AVX-512. The ones relevant to this PR are AVX-512VL and AVX-512FP16. With AVX-512VL, instructions can operate on 128-bit and 256-bit vectors instead of just 512-bit vectors, and AVX-512FP16 allows for instructions to operate on 16-bit floating pointer vectors. The table below displays the vectors this function will lower and the requirements.
| Types | Bit Width | Requirements / Notes | Instruction |
|---|---|---|---|
| v16f32 | 512-bit | - Full 512-bit native vector - No extensions required |
VSCALEFPS |
| v8f64 | 512-bit | - Full 512-bit native vector - No extensions required |
VSCALEFPD |
| v8f32 / v4f64 | 256-bit | - Requires AVX512VL (VLX), or - Widening to 512-bit |
VSCALEFPS / VSCALEFPD |
| v4f32 / v2f64 | 128-bit | - Requires AVX512VL (VLX), or - Widening to 512-bit |
VSCALEFPS / VSCALEFPD |
| v32f16 | 512-bit | - Requires AVX512FP16 | VSCALEFPH |
| v16f16 / v8f16 | 128–256-bit | - Requires FP16 + VLX - Requires FP16 + widening to 512-bit (no VLX) - Requires extend f16→f32 and i16→i32 + widening (no FP16 & no VLX) |
VSCALEFPH (if FP16) |
| Scalar f64 | 64-bit | - Insert into 128-bit vector | VSCALEFSD |
| Scalar f32 | 32-bit | - Insert into 128-bit vector | VSCALEFSS |
| Scalar f16 | 16-bit | - Requires AVX512FP16, or - Extend f16→f32 and use scalar f32 path |
VSCALEFSH (if FP16) |
How to call vscalef*
LLVM’s X86 backend uses SelectionDAG nodes to represent target-specific operations during instruction selection. X86ISD::SCALEF and X86ISD::SCALEFS is the X86 backend’s internal opcode for the vscalef and vscalefs (scalar) instruction.
In the code below you will see occurrences of ISD::* and these represent generic
DAG opcodes, such as ISD::ADD. X86ISD::* represent DAG nodes specific to the X86 backend.
Difficulties
The biggest hurdle I faced implementing this PR was correctly handling FP16 vectors (v8f16, v16f16) when the AVX512FP16 extension is not available.
The operation requires converting the integer exponent vector (vNi16) into a floating-point exponent vector (vNf16). If the FP16 instruction set isn’t supported, LLVM literally cannot convert an i16 integer into an f16 floating-point number.
Honestly, the solution was pretty simple. Instead of converting the vNi16 to an vNf16, I would sign-extend the Exp vector to an i32 vector.
graph TD
subgraph "Before"
A[v8f16 / v16f16<br/>FP16 Vector] --> B[Extract i16 Exp]
B --> C[Convert i16 → fp16<br/>❌ NOT SUPPORTED]
C --> D[fp16 Exponent]
D --> E[ldexp operation]
end
graph TD
subgraph "After"
A[v8f16 / v16f16<br/>FP16 Vector] --> B[Widen to v8f32 / v16f32]
A --> D[Widen to i32 Exp]
B --> E[Widen v8f32 → v16f32]
D --> F[Widen i32 Exp → v16i32<br/>512-bit vector]
E --> G[ldexp operation<br/>✓ FP32 supported]
F --> G
end
As shown below, when the target has VLX and FP16 support, SCALEF can be directly called on v8f16/v16f16 vectors. Without VLX (but with FP16), LowerFLDEXP will widen the vectors to 512-bits (v32f16). Without either, I have to extend both X and Exp to 32 bits, then widen to 512-bits (v16f32).
case MVT::v8f16:
case MVT::v16f16:
if (Subtarget.hasFP16()) {
if (Subtarget.hasVLX()) {
Exp = DAG.getNode(ISD::SINT_TO_FP, DL, XTy, Exp);
return DAG.getNode(X86ISD::SCALEF, DL, XTy, X, Exp);
}
break;
}
X = DAG.getFPExtendOrRound(X, DL, XTy.changeVectorElementType(MVT::f32));
Exp = DAG.getSExtOrTrunc(Exp, DL,
X.getSimpleValueType().changeTypeToInteger());
break;For v32f16, no prep is needed if it has FP16 support. Otherwise, I would have to split the v32f16 vector into two v16f16 vectors, and recursively call LoewerFLDEXP on the two, concatenating their results.
case MVT::v32f16:
if (Subtarget.hasFP16()) {
Exp = DAG.getNode(ISD::SINT_TO_FP, DL, XTy, Exp);
return DAG.getNode(X86ISD::SCALEF, DL, XTy, X, Exp);
}
SDValue Low = DAG.getExtractSubvector(DL, MVT::v16f16, X, 0);
SDValue High = DAG.getExtractSubvector(DL, MVT::v16f16, X, 16);
SDValue ExpLow = DAG.getExtractSubvector(DL, MVT::v16i16, Exp, 0);
SDValue ExpHigh = DAG.getExtractSubvector(DL, MVT::v16i16, Exp, 16);
SDValue OpLow = DAG.getNode(ISD::FLDEXP, DL, MVT::v16f16, Low, ExpLow);
SDValue OpHigh = DAG.getNode(ISD::FLDEXP, DL, MVT::v16f16, High, ExpHigh);
SDValue ScaledLow = LowerFLDEXP(OpLow, Subtarget, DAG);
SDValue ScaledHigh = LowerFLDEXP(OpHigh, Subtarget, DAG);
return DAG.getNode(ISD::CONCAT_VECTORS, DL, MVT::v32f16, ScaledLow,
ScaledHigh);Which after being pointed out by the PR’s reviewer, I found out the splitting/concatenating can be simply reduced to the helper function below.
return splitVectorOp(Op, DAG, DL);The code below handles the cases where VLX is not supported, in which the vectors need to be widened to 512-bits.
SDValue WideX = widenSubVector(X, true, Subtarget, DAG, DL, 512);
SDValue WideExp = widenSubVector(Exp, true, Subtarget, DAG, DL, 512);
Exp = DAG.getNode(ISD::SINT_TO_FP, DL, WideExp.getSimpleValueType(), Exp);
SDValue Scalef =
DAG.getNode(X86ISD::SCALEF, DL, WideX.getValueType(), WideX, WideExp);
SDValue Final =
DAG.getExtractSubvector(DL, X.getSimpleValueType(), Scalef, 0);
return DAG.getFPExtendOrRound(Final, DL, XTy);Conclusion
Overall, that was a pretty fun PR. I’m very grateful to my reviewer for his patience. It definitely pushed me to write code in a more optimal manner. It was also satisfying trying to find ways to reduce the number of lines I introduce, and is something I’ll focus on as I keep contributing. Looking forward to my next PR.
Resources
- https://cdrdv2.intel.com/v1/dl/getContent/671200
- Intel’s manual which contains documentation on the AVX-512 instructions, including VSCALE
- https://llvm.org/doxygen/classllvm_1_1SelectionDAG.html