Retargeting LLVM to Stack Machines: A WebAssembly Backend Deep Dive

Introduction:

When retargeting compiler backends, especially in LLVM, we are usually discussing native chips like CPUs or GPUs. While the underlying architecture of each vendor differs—cores and memory models vary wildly—they share a high-level abstraction: they are register machines.

LLVM was primarily designed to target register machines. This makes sense, given that LLVM IR is an SSA (Static Single Assignment) IR, which models an infinite virtual register set.

However, stack machines are a different beast. While stack machines are rare in physical hardware today, they are the standard for virtual machine code (CLR/.NET, JVM, WASM). These sophisticated runtimes often have their own JIT compilers, but if the compiler targeting them generates unoptimized bytecode, performance and/or size may still suffer.

The Challenge:

You might ask: "If LLVM is fundamentally built for register machines, how do we target a stack-based architecture?"

The simple answer is: we don't fight the design; we go with the flow. (I will explain what that means later).

To understand this, we need to look at the standard LLVM backend pipeline. Most backends share these core passes:

Instruction Selection (Mandatory & Optimizing)
Instruction Scheduling (Optional but usually necessary)
Register Allocation (Mandatory & Optimizing)
Object File Emission (Fundamental)

The conflict begins at step one: Instruction Selection.

To select instructions, LLVM builds a Directed Acyclic Graph called the ISelDAG. In this graph:

Vertices represent operations (ADD, LOAD, STORE, etc.). These can be LLVM built-ins or target-specific operations.
Edges represent dependencies. These can be data dependencies (black), chain dependencies for ordering (blue), or glue dependencies to keep nodes together (red).

LLVM can generate a visual representation of the ISelDAG. Consider this simple C++ loop:

extern void do_work();

// This will require BLOCK, END_BLOCK and LOOP, END_LOOP markers.
void test_loop(int n) {
  for (int i = 0; i < n; i++) {
    do_work();
  }
}

When compiled to LLVM IR clang++ --target=wasm32 -S -emit-llvm -O3 -o debug_sample.ll debug_sample.cpp, it looks like this:

; ModuleID = 'debug_sample.cpp'
source_filename = "debug_sample.cpp"
target datalayout = "e-m:e-p:32:32-p10:8:8-p20:8:8-i64:64-i128:128-n32:64-S128-ni:1:10:20"
target triple = "wasm32"

; Function Attrs: mustprogress
define hidden void @_Z9test_loopi(i32 noundef %0) local_unnamed_addr #0 {
  %2 = icmp sgt i32 %0, 0
  br i1 %2, label %4, label %3

3:                                                ; preds = %4, %1
  ret void

4:                                                ; preds = %1, %4
  %5 = phi i32 [ %6, %4 ], [ 0, %1 ]
  tail call void @_Z7do_workv()
  %6 = add nuw nsw i32 %5, 1
  %7 = icmp eq i32 %6, %0
  br i1 %7, label %3, label %4, !llvm.loop !2
}

declare void @_Z7do_workv() local_unnamed_addr #1

Figure 1: Selection DAG of sample C++ code

And finally, here is the machine function (the output of instruction selection in LLVM):

bb.0 (%ir-block.1):
  successors: %bb.1(0x50000000), %bb.2(0x30000000); %bb.1(62.50%), %bb.2(37.50%)
  liveins: $arguments
  %3:i32 = CONST_I32 1, implicit-def dead $arguments
  %2:i32 = ARGUMENT_i32 0, implicit $arguments
  %4:i32 = LT_S_I32 %2:i32, killed %3:i32, implicit-def dead $arguments
  BR_IF %bb.2, killed %4:i32, implicit-def dead $arguments
  BR %bb.1, implicit-def dead $arguments

bb.1..preheader:
; predecessors: %bb.0
  successors: %bb.3(0x80000000); %bb.3(100.00%)

  BR %bb.3, implicit-def dead $arguments

bb.2 (%ir-block.3):
; predecessors: %bb.0, %bb.3

  RETURN implicit-def dead $arguments

bb.3 (%ir-block.4):
; predecessors: %bb.1, %bb.3
  successors: %bb.2(0x04000000), %bb.3(0x7c000000); %bb.2(3.12%), %bb.3(96.88%)

  %0:i32 = PHI %2:i32, %bb.1, %1:i32, %bb.3
  %5:i32 = CONST_I32 -1, implicit-def dead $arguments
  %1:i32 = ADD_I32 %0:i32, killed %5:i32, implicit-def dead $arguments
  CALL_PARAMS @_Z7do_workv, implicit-def dead $arguments, implicit $sp32, implicit $sp64
  CALL_RESULTS implicit-def dead $arguments, implicit $sp32, implicit $sp64
  BR_UNLESS %bb.2, %1:i32, implicit-def dead $arguments
  BR %bb.3, implicit-def dead $arguments

Here is the problem:

The Instruction Selection DAG assumes a model where every operation explicitly names its referenced operands (virtual registers, physical registers, etc.). The entire backend pipeline relies on this register-machine-like model to reason about data flow.

Stack machines, by contrast, have implicit operands. Instructions simply push or pop values from the evaluation stack. Because TableGen, when modeling the ISA, and the ISelDAG matcher expect explicit inputs and outputs (if any), the stack machine model does not fit naturally into LLVM's selection process.

The Two Flavors ISA:

ISA TableGen Definition:

LLVM WASM developers handled the misfit above elegantly, making WASM fit LLVM by introducing a two-flavored ISA:

Register Flavor: This is the flavor used throughout most of the instruction's life in the backend. Each WASM instruction is represented as if it were a register machine instruction, thus making it possible to use the instruction selector and other backend analyzers.
Stack Flavor: For every register-flavored instruction, there is a corresponding stack-flavored version. The backend exits the register form into the stack one very late in the backend pipeline or the code emission phase! (In LLVM words, we do this when lowering from MachineInstr to MCInst).

The backend handles both WASM32 (32-bit) and WASM64 (64-bit). Now, here is how the WASM instruction format is modeled using TableGen (you can check WebAssemblyInstrFormat.td):


class WebAssemblyInst<bits<32> inst, string asmstr, bit stack, bit is64>
  : StackRel, RegisterRel, Wasm64Rel, Instruction {
  bits<32> Inst = inst; // Instruction encoding.
  bit StackBased = stack;
  string BaseName = NAME;
  bit IsWasm64 = is64;
  string Wasm32Name = !subst("_A64", "_A32", NAME);
  let Namespace   = "WebAssembly";
  let Pattern     = [];
  let AsmString   = asmstr;
  bit IsCanonical = 0;
}

// Normal instructions. Default instantiation of a WebAssemblyInst.
class NI<dag oops, dag iops, list<dag> pattern, bit stack,
          string asmstr = "", bits<32> inst = -1, bit is64 = false>
    : WebAssemblyInst<inst, asmstr, stack, is64> {
  dag OutOperandList = oops;
  dag InOperandList  = iops;
  let Pattern        = pattern;
  let Defs           = [ARGUMENTS];
}

multiclass I<dag oops_r, dag iops_r, dag oops_s, dag iops_s,
             list<dag> pattern_r, string asmstr_r = "", string asmstr_s = "",
             bits<32> inst = -1, bit is64 = false> {
  let isCodeGenOnly = 1 in
  def "" : NI<oops_r, iops_r, pattern_r, false, asmstr_r, inst, is64>;
  let BaseName = NAME in
  def _S : NI<oops_s, iops_s, [], true, asmstr_s, inst, is64>;
}

// For instructions that have no register ops, so both sets are the same.
multiclass NRI<dag oops, dag iops, list<dag> pattern, string asmstr = "",
               bits<32> inst = -1> {
  defm "": I<oops, iops, oops, iops, pattern, asmstr, asmstr, inst>;
}

This is the root of the “magic spell,“ but let me explain it:

WebAssemblyInst: This acts as a template (think of it as a C++ class template); each instruction is an instantiation of it, whether it’s register or stack flavored.
NI: A default instantiation of WebAssemblyInst to avoid lots of boilerplate when modeling the real instructions later. All fields are overrides of LLVM built-in TableGen class Instruction.
- OutOperandList: Specifies the output operands (results) that this instruction defines/produces.
- InOperandList: Specifies the input operands that this instruction reads/uses.
- Patterns: Represent a list of selection patterns that, when the matcher matches against it, will be replaced with such an instruction.
- Defs: Represent the set of registers that the instruction defines. For example: %verg=Instr … //%vreg is said to be defined here. The ARGUMENTS inside the brackets is a special opaque virtual register used to preserve the ordering of stack frame codegen-only setup instructions (ARGUMENTS_<…>) relative to instructions below it. ARGUMENTS_<…> use/read the ARGUMENTS register while other instructions define it, creating a data dependency. e.g.: ;loading first 32bit int argument (represented by index 0) while implicitly using $arguments register %6:i32 = ARGUMENT_i32 0, implicit $arguments
I: This is a meta-building facility to create the instantiations of both register and stack-flavored instructions. The template arguments with _r will be used to instantiate the register-flavored instructions, and _s for the stack ones. Stack instructions are all postfixed with _S, while register ones will use the instruction name as-is. You can see that the fourth argument of the instantiation is set to false with def ““, setting the isStack bit to 0, and vice-versa for the stack one.
NRI: If an instruction doesn’t require any operand in either flavor, then the output and input DAG of both forms are the same! This class is mainly for readability and to avoid boilerplate by writing redundant DAGs.

Integer Arithmetic Instructions Example:

Let’s examine the definition of integer arithmetic instructions (you can find it under WebAssemblyInstrInteger.td).

....
multiclass BinaryInt<SDNode node, string name, bits<32> i32Inst,
                     bits<32> i64Inst> {
  defm _I32 : I<(outs I32:$dst), (ins I32:$lhs, I32:$rhs), (outs), (ins),
                [(set I32:$dst, (node I32:$lhs, I32:$rhs))],
                !strconcat("i32.", !strconcat(name, "\t$dst, $lhs, $rhs")),
                !strconcat("i32.", name), i32Inst>;
  defm _I64 : I<(outs I64:$dst), (ins I64:$lhs, I64:$rhs), (outs), (ins),
                [(set I64:$dst, (node I64:$lhs, I64:$rhs))],
                !strconcat("i64.", !strconcat(name, "\t$dst, $lhs, $rhs")),
                !strconcat("i64.", name), i64Inst>;
}
.....

let isCommutable = 1 in
defm ADD : BinaryInt<add, "add ", 0x6a, 0x7c>;
defm SUB : BinaryInt<sub, "sub ", 0x6b, 0x7d>;

Here BinaryInt defines a 32-bit and 64-bit variant of both register and stack-flavored binary arithmetic instructions. Let me explain what is going on here briefly (if you know TableGen already, you can skip this part).

_I32 and _I64: The I here refers to integers and the 32 to 32-bit integers (_I64 follows the same pattern). The resulting instruction of such a definition is INSTRNAME_I32 and INSTRNAME_I64, so for the ADD instruction, you will get ADD_I32 and ADD_I64 also ADD_I32_S and ADD_I64_S
Register Form: The out DAG is a single I32 operand (represented by $dst), and the input DAG is two I32 operands (represented by $lhs and $rhs). This is represented by: (outs I32:$dst), (ins I32:$lhs, I32:$rhs).
Stack Form: There are no output or input DAGs here because ADD in WASM implicitly pops the last two operands from the stack and uses them implicitly (typical for a stack machine). This is represented by: (outs), (ins).
Patterns: Here we have a single inlined pattern. The pattern is simple, and in plain English, it says: “Hey instruction selector, if you matched against SDNode (aka operation) of kind node that has an edge defining a virtual register of type I32 and two edges using two operands of type I32, then select this binary arithmetic instruction“.
- This is represented by: [(set I32:$dst, (node I32:$lhs, I32:$rhs))].
- Example: Check Figure 1's graph, specifically the ADD node; such a subgraph will get selected as ADD_I32, for example.
Assembler And Disassembler Strings: This is for the assembly parser and disassembler to know what the string representation of the instruction is in both flavors.
- This is represented by: !strconcat("i32.", !strconcat(name, "\t$dst, $lhs, $rhs")), !strconcat("i32.", name),
- Example: For the WASM add instruction, if the assembler or disassembler is asked for the human-readable text of the instruction, you will get this, for example: Register Form: i32.add %vreg0,%vreg1,42 Stack Form: i32.add
Parameters i32Inst And i64Inst: Those are WASM instruction-encoded binary opcodes.
- Example: For instruction ADD it’s represented by 0x6a and 0x7c, respectively.

Opaque Registers And Register Units:

The WASM backend models registers and register units. There are a couple of them, mainly used to preserve ordering. Some model WASM stack manipulation globals like __stack_pointer, and some are used just for type tagging in instruction selection.

Here is how registers are modeled in the WASM backend (comments below are part of the source code):

class WebAssemblyReg<string n> : Register<n> {
  let Namespace = "WebAssembly";
}

class WebAssemblyRegClass<list<ValueType> regTypes, int alignment, dag regList>
     : RegisterClass<"WebAssembly", regTypes, alignment, regList>;

//===----------------------------------------------------------------------===//
// Registers
//===----------------------------------------------------------------------===//

// Special registers used as the frame and stack pointer.
//
// WebAssembly may someday supports mixed 32-bit and 64-bit heaps in the same
// application, which requires separate width FP and SP.
def FP32 : WebAssemblyReg<"%FP32">;
def FP64 : WebAssemblyReg<"%FP64">;
def SP32 : WebAssemblyReg<"%SP32">;
def SP64 : WebAssemblyReg<"%SP64">;

// The register allocation framework requires register classes have at least
// one register, so we define a few for the integer / floating point register
// classes since we otherwise don't need a physical register in those classes.
// These are also used a "types" in the generated assembly matcher.
def I32_0 : WebAssemblyReg<"%i32.0">;
def I64_0 : WebAssemblyReg<"%i64.0">;
def F32_0 : WebAssemblyReg<"%f32.0">;
def F64_0 : WebAssemblyReg<"%f64.0">;

def V128_0: WebAssemblyReg<"%v128">;
//more registers for external functions and symbols reference 

/ The value stack "register". This is an opaque entity which serves to order
// uses and defs that must remain in LIFO order.
def VALUE_STACK : WebAssemblyReg<"STACK">;

// The incoming arguments "register". This is an opaque entity which serves to
// order the ARGUMENT instructions that are emulating live-in registers and
// must not be scheduled below other instructions.
def ARGUMENTS : WebAssemblyReg<"ARGUMENTS">;

//===----------------------------------------------------------------------===//
//  Register classes
//===----------------------------------------------------------------------===//

def I32 : WebAssemblyRegClass<[i32], 32, (add FP32, SP32, I32_0)>;
def I64 : WebAssemblyRegClass<[i64], 64, (add FP64, SP64, I64_0)>;
def F32 : WebAssemblyRegClass<[f32], 32, (add F32_0)>;
def F64 : WebAssemblyRegClass<[f64], 64, (add F64_0)>;
def V128 : WebAssemblyRegClass<[v8f16, v4f32, v2f64, v2i64, v4i32, v16i8,
                                v8i16],
                               128, (add V128_0)>;

//more register classes for the external ref ones mentioned before

In LLVM, each target must at least have one register unit, and each register unit must have at least one register. The WASM backend satisfies this by creating some units with dummy registers like I32_0 or physical registers like SP32/SP64. Needless to say, WASM doesn’t have any (real) physical registers. We create SP and FP mainly because WASM uses __stack_pointer when a stack slot memory location/address is required (pointer or reference to a local) or when allocating space on the stack (e.g., char arr[1024]), among a number of other use cases.

In unoptimized builds, LLVM will generate __stack_pointer and use it to allocate stack memory in WASM linear memory and use i32-based memory instructions to load or store data instead of using local-based memory instructions, which is more expensive. Additionally, __stack_pointer will get fetched before loading or writing to the stack; however, it makes codegen simpler since accessing stack slots is similar to conventional architectures like X86 or ARM, where you simply use stack pointer or base pointer registers to access stack slots (variables, constants, etc.).

Registers:

Let's go through the important registers:

SP: I have explained this already; this is meant to map to the WASM __stack_pointer global, and it’s not always generated. It depends on what the function is doing and if the build is optimized or not.
FP: Frame/Base pointer. This gets created just like any other local variable, but it will get used for relative memory access to stack slot N. Like SP, it’s not always generated; in fact, the backend generates it when SP needs to be used dynamically (C VLAs, for example) and the compiler needs to keep a reference to the old SP value when setting up the frame (same usage as usual native CPU architecture base pointer registers, e.g., RBP in X64).
VALUE_STACK: This register's sole purpose is to maintain LIFO order between instructions in trees (a tree here is an instruction that defines a virtual register, and all uses of it come just after it, forming a single-use tree or list of instructions). In such a case, we strictly don’t want any instruction reordering to happen, avoiding breaking such a usage pattern. (This register is implicitly defined and used when running a pass called ‘Register Stackifying,’ which is a size optimization that reorders instructions, putting virtual register defs and uses in an order where the tree forms a ‘single use.’ When a register fits such a form, you no longer need to load it explicitly; instead, it will be pushed and popped to the evaluation stack and used as such—aka, it will be an unnamed value).
ARGUMENTS: When passing arguments to a function and when doing formal arguments lowering, all used or alive function parameters will be loaded using a high-level codegen-only operation/instruction called ARGUMENT. All functions initially start with N number of ARGUMENT instructions (a pass called wasm-argument-move guarantees this as it runs directly after ISelDAG scheduler or linearizer), where N is the number of function parameters. Instruction ARGUMENT implicitly reads or uses ARGUMENTS, and all other instructions define it implicitly! Why is that, you may be asking? Simply so no pass that reorders instructions will move any ARGUMENT instruction below other instructions. This is to avoid having any instruction later that may reference arguments through their index instead of the virtual register, where a load afterward will violate program semantics.
I32_0 & I64_0: Placeholder integer registers just to get added to register classes (like I32 and I64) because it’s mandatory. Of course, SP32/64 and FP32/F64 are special registers and cannot be used for general-purpose usage.
F32_0 & F64_0: Same story as the integer ones but for floating-points.

Register Classes:

In LLVM, register classes refer to different register files. For example, in X86, the general-purpose registers like %EAX or %EDI are part of the GPR (general purpose) register files, and %ymm1, %ymm2, etc., belong to AVX register files, and so on. It’s no different here in WASM, but again, those are opaque entities unlike ARM, X86, etc. We use the register classes mainly for typing, so the instruction selector and later passes are aware that an operation against 32-bit operands is different than 64-bit operands (recheck the patterns above from Integer Arithmetic Instructions Example). I guess it’s safe to say the register classes are self-explanatory! Each class here passes the following to WebAssemblyRegClass:

The width of the register: It’s in bits, so 32 bits for 4 bytes like I32.
The alignment: It’s for register allocator stack spilling alignment (unused in the case of WASM because LLVM physical register allocators like Greedy, PBQP, etc., are not used).
Register set: The set of registers that belongs to this class.

CFG Stackification:

If you are here just to check how you can model your stack machine backend to retarget LLVM, you can jump directly to “Exiting To Stack Form”. The rest of the paragraphs focus on WASM specifics and implementation details. If your VM has structured control flow, then I recommend checking this paragraph.

In conventional architectures like X86 or ARM and code VMs like the CLR, control flow is unstructured, meaning you can jump to addresses/labels in a function with minor to no restrictions. On the other hand, WASM has structured control flow, meaning that jumps are restricted to certain control flow structures like if/else or loops; you cannot just do a goto/jmp to an arbitrary label anywhere in your function.

The Problem:

LLVM models control flow as a cyclic graph (nodes and edges). WebAssembly models control flow effectively as a tree (parent nodes containing child nodes). Mapping a graph with cycles and cross-edges onto a strict tree structure requires complex reconstruction.

You can think of LLVM CFGs as spaghetti and WASM CFGs as Lasagna ! (I am genuinely hungry right now 😄😋)

Figure 2: LLVM CFG VS WASM CFG

The control flow between the two models is quite different (but of course, semantically it’s the same). In WASM, the control flow needs to jump either to the end of the block and fall through to the next block or jump to the start of the block (think of them as break and continue in C, C++, or C#), to avoid drowning this post, I would recommend checking out what block and loop instructions in WASM do to have an idea why the CFG looks how it is.

So how do we go from LLVM unstructured CFG to WASM structured CFG!? There are number of passes involved mainly here. I will go through them briefly, except for the CFG stackifier aka wasm-cfg-stackify, which I will explain its code and how it works in detail:

wasm-fix-irreducible-control-flow: This pass simply transforms irreducible CFG into a reducible one so loop body is dominated by it’s header.
wasm-cfg-sort: It will topologically sort CFG basic blocks, ignoring backward edges. This will help in linearizing the CFG, making it fit for inserting the structured control flow markers or instructions. Reordering nodes here happens to CFG basic block numbers, but it doesn’t mutate the CFG itself.
wasm-cfg-stackify: The pass that will insert LOOP\END_LOOP, BLOCK\END_BLOCK, and exception handling markers like TRY, TRY_TABLE in the right basic blocks in the CFG so that the CFG really models WASM control flow and has the right instructions in place.

The Core Stackification:

Before starting, when I say machine function, I am referring to its control flow representation aka the CFG, and vice versa unless specified otherwise.

To navigate through the CFG stackification algorithm, I will show how the machine function looks before CFG stackification and after it. The code below is the corresponding code of the C++ loop sample above.

Before:

# *** IR Dump Before WebAssembly CFG Stackify (wasm-cfg-stackify) ***:
# Machine code for function _Z9test_loopi: NoPHIs, TiedOpsRewritten
Function Live Ins: $arguments, $value_stack

bb.0 (%ir-block.1):
  successors: %bb.1(0x50000000), %bb.3(0x30000000); %bb.1(62.50%), %bb.3(37.50%)

  %6:i32 = ARGUMENT_i32 0, implicit $arguments
  %3:i32 = CONST_I32 1, implicit-def dead $arguments, implicit-def $value_stack, implicit $value_stack
  %4:i32 = LT_S_I32 %6:i32, %3:i32, implicit-def dead $arguments, implicit-def $value_stack, implicit $value_stack
  BR_IF %bb.3, %4:i32, implicit-def $arguments, implicit-def $value_stack, implicit $value_stack

bb.1:
; predecessors: %bb.0
  successors: %bb.2(0x80000000); %bb.2(100.00%)


bb.2 (%ir-block.4):
; predecessors: %bb.2, %bb.1
  successors: %bb.3(0x04000000), %bb.2(0x7c000000); %bb.3(3.12%), %bb.2(96.88%)

  CALL @_Z7do_workv, implicit-def dead $arguments, implicit $sp32, implicit $sp64, implicit-def dead $arguments, implicit $sp32, implicit $sp64
  %9:i32 = CONST_I32 -1, implicit-def dead $arguments, implicit-def $value_stack, implicit $value_stack
  %8:i32 = ADD_I32 %6:i32, %9:i32, implicit-def dead $arguments, implicit-def $value_stack, implicit $value_stack
  %7:i32, %6:i32 = TEE_I32 %8:i32, implicit-def $arguments, implicit-def $value_stack, implicit $value_stack
  BR_IF %bb.2, %7:i32, implicit-def $arguments, implicit-def $value_stack, implicit $value_stack

bb.3 (%ir-block.3):
; predecessors: %bb.0, %bb.2

  RETURN implicit-def dead $arguments

# End machine code for function _Z9test_loopi.

After:

# *** IR Dump After WebAssembly CFG Stackify (wasm-cfg-stackify) ***:
# Machine code for function _Z9test_loopi: NoPHIs, TiedOpsRewritten
Function Live Ins: $arguments, $value_stack

bb.0 (%ir-block.1):
  successors: %bb.1(0x50000000), %bb.3(0x30000000); %bb.1(62.50%), %bb.3(37.50%)

  %6:i32 = ARGUMENT_i32 0, implicit $arguments
  BLOCK 64, implicit-def $value_stack, implicit $value_stack
  %3:i32 = CONST_I32 1, implicit-def dead $arguments, implicit-def $value_stack, implicit $value_stack
  %4:i32 = LT_S_I32 %6:i32, %3:i32, implicit-def dead $arguments, implicit-def $value_stack, implicit $value_stack
  BR_IF 0, %4:i32, implicit-def $arguments, implicit-def $value_stack, implicit $value_stack

bb.1:
; predecessors: %bb.0
  successors: %bb.2(0x80000000); %bb.2(100.00%)


bb.2 (%ir-block.4):
; predecessors: %bb.2, %bb.1
  successors: %bb.3(0x04000000), %bb.2(0x7c000000); %bb.3(3.12%), %bb.2(96.88%)

  LOOP 64, implicit-def $value_stack, implicit $value_stack
  CALL @_Z7do_workv, implicit-def dead $arguments, implicit $sp32, implicit $sp64, implicit-def dead $arguments, implicit $sp32, implicit $sp64
  %9:i32 = CONST_I32 -1, implicit-def dead $arguments, implicit-def $value_stack, implicit $value_stack
  %8:i32 = ADD_I32 %6:i32, %9:i32, implicit-def dead $arguments, implicit-def $value_stack, implicit $value_stack
  %7:i32, %6:i32 = TEE_I32 %8:i32, implicit-def $arguments, implicit-def $value_stack, implicit $value_stack
  BR_IF 0, %7:i32, implicit-def $arguments, implicit-def $value_stack, implicit $value_stack

bb.3 (%ir-block.3):
; predecessors: %bb.0, %bb.2

  END_LOOP implicit-def $value_stack, implicit $value_stack
  END_BLOCK implicit-def $value_stack, implicit $value_stack
  RETURN implicit-def dead $arguments
  END_FUNCTION implicit-def $value_stack, implicit $value_stack

# End machine code for function _Z9test_loopi.

Notice the loop and block markers (LOOP <-> END_LOOP and BLOCK <-> END_BLOCK) that got inserted after stackification. Here is how it happens:

Under WebAssemblyCFGStackify, there is one function in particular called placeMarker which will insert the BLOCK/LOOP/TRY/TRY_TABLE markers. Here is how placeMarker looks:

void WebAssemblyCFGStackify::placeMarkers(MachineFunction &MF) {
//..... 
  for (auto &MBB : MF)
    placeLoopMarker(MBB);

  const MCAsmInfo *MCAI = MF.getTarget().getMCAsmInfo();
  for (auto &MBB : MF) {
    if (MBB.isEHPad()) {
      // Place the TRY/TRY_TABLE for MBB if MBB is the EH pad of an exception.
      if (MCAI->getExceptionHandlingType() == ExceptionHandling::Wasm &&
          MF.getFunction().hasPersonalityFn()) {
        if (WebAssembly::WasmEnableExnref)
          placeTryTableMarker(MBB);
        else
          placeTryMarker(MBB);
      }
    } else {
      // Place the BLOCK for MBB if MBB is branched to from above.
      placeBlockMarker(MBB);
    }
  }


//..........

As you can see, the first loop inserts loop markers first, and the second loop inserts EH and BLOCK markers. I will not talk about EH markers, but only LOOP and BLOCK.

placeLoopMarker:

We can break down the placeLoopMarker logic into distinct phases, using our running example of the C++ loop where bb.2 is the loop body.

Phase 1: Identifying the Loop and Bottom

First, we identify the loop header. The pass uses LLVM loop analyzers to extract loops from the CFG (which internally uses SCCA—Strongly Connected Components Analysis). This is represented by the call MLI.getLoopFor(&MBB). In our example code, this block would be bb.2.

Next, it checks for the proper block to insert LOOP_END by calling SRI.getBottom(Loop). This function visits the extracted loop starting from its header to find the bottom—the last node in the loop ordered after all others.

In our case, the bottom is the same block bb.2 because it is the only block in the loop.
The code auto Iter = std::next(Bottom->getIterator()); gets the next node (in our example, bb.3). If this were null, a new "appendix" block would be inserted. In any case, AfterLoop points to this block.

void WebAssemblyCFGStackify::placeLoopMarker(MachineBasicBlock &MBB) {
  MachineLoop *Loop = MLI.getLoopFor(&MBB);
  if (!Loop || Loop->getHeader() != &MBB)
    return;

  // The operand of a LOOP is the first block after the loop. If the loop is the
  // bottom of the function, insert a dummy block at the end.
  MachineBasicBlock *Bottom = SRI.getBottom(Loop);
  auto Iter = std::next(Bottom->getIterator());
  if (Iter == MF.end()) {
    getAppendixBlock(MF);
    Iter = std::next(Bottom->getIterator());
  }
  MachineBasicBlock *AfterLoop = &*Iter;

Phase 2: Determining the Insertion Point (Header)

Before inserting the LOOP marker, the pass must ensure the insertion position is correct and that it doesn’t interleave with the ends of other loops (END_LOOP).

The BeforeSet is used to register all loop end markers (END_LOOP) in the current block. We then use this set to find the earliest valid location to insert the new LOOP marker. In other words, the new marker should be placed after any END_LOOP marker in the same basic block to handle nesting correctly. (The AfterSet here is primarily for verification during development).

// Decide where in Header to put the LOOP.
  SmallPtrSet<const MachineInstr *, 4> BeforeSet;
  SmallPtrSet<const MachineInstr *, 4> AfterSet;
  for (const auto &MI : MBB) {
    // LOOP marker should be after any existing loop that ends here. Otherwise
    // we assume the instruction belongs to the loop.
    if (MI.getOpcode() == WebAssembly::END_LOOP)
      BeforeSet.insert(&MI);
  }

Phase 3: Inserting the Loop Marker

Since there are no other loops in our sample, the LOOP marker will be inserted at the beginning of the basic block (bb.2).

This happens because BeforeSet is empty. The call getEarliestInsertPos(&MBB, BeforeSet, AfterSet); visits instructions in the basic block bottom-up. If an instruction is not in the set, the iterator decrements (moves one instruction up). If it finds a match, it stops. Since no match is found, it returns the start of the block.

Viola! The pass can now insert the LOOP marker using BuildMI.

auto InsertPos = getEarliestInsertPos(&MBB, BeforeSet, AfterSet);
  MachineInstr *Begin = BuildMI(MBB, InsertPos, MBB.findDebugLoc(InsertPos),
                                TII.get(WebAssembly::LOOP))
                            .addImm(int64_t(WebAssembly::BlockType::Void));

Phase 4: Closing the Loop

Inserting the END_LOOP follows the same pattern. We need to place it at the start of the AfterLoop block (which is bb.3 in our example).

We call getEarliestInsertPos(AfterLoop, BeforeSet, AfterSet); to find the insertion point. (The reason for using getEarliestInsertPos rather than a simple AfterLoop->begin() is to support development-time verification).

// Decide where in MBB to put the END_LOOP.
  BeforeSet.clear();
  AfterSet.clear();

  // Mark the end of the loop (using arbitrary debug location that branched to
  // the loop end as its location).
  InsertPos = getEarliestInsertPos(AfterLoop, BeforeSet, AfterSet);
  DebugLoc EndDL = AfterLoop->pred_empty()
                         ? DebugLoc()
                         : (*AfterLoop->pred_rbegin())->findBranchDebugLoc();
  MachineInstr *End =
      BuildMI(*AfterLoop, InsertPos, EndDL, TII.get(WebAssembly::END_LOOP));

  registerScope(Begin, End);
  updateScopeTops(&MBB, AfterLoop);
}

placeBlockMarker:

While placeLoopMarker deals with natural loops, placeBlockMarker handles control flow joins (like if statements or switch cases). The logic here is more complex because BLOCK scopes are synthesized based on the graph's dominance properties.

Just like placeLoopMarker, we can break the implementation into distinct phases: Header Search, Scope Adjustment, Dependency Analysis, and Insertion.

Phase 1: Identifying the Header

First, we need to determine where the BLOCK should start. The start of a BLOCK (the header) must be a "Nearest Common Dominator" for all paths reaching the current block (MBB).

We iterate through all predecessors. If a predecessor dominates the current block and explicitly branches to it, it is a candidate for the header.

In our example, bb.3 has two predecessors: bb.0 (the entry) and bb.2 (the loop back-edge).

bb.0 dominates bb.3.
bb.0 explicitly branches to bb.3. Therefore, bb.0 becomes the candidate Header.

void WebAssemblyCFGStackify::placeBlockMarker(MachineBasicBlock &MBB) {
  MachineFunction &MF = *MBB.getParent();
  MachineBasicBlock *Header = nullptr;
  bool IsBranchedTo = false;
  int MBBNumber = MBB.getNumber();

  // 1. Find the nearest common dominator of all predecessors
  for (MachineBasicBlock *Pred : MBB.predecessors()) {
    if (Pred->getNumber() < MBBNumber) {
      Header = Header ? MDT->findNearestCommonDominator(Header, Pred) : Pred;
      if (explicitlyBranchesTo(Pred, &MBB))
        IsBranchedTo = true;
    }
  }

  // If there is no valid header or no explicit branch, we don't need a BLOCK.
  if (!Header) return;
  if (!IsBranchedTo) return;

Phase 2: Scope Adjustment

Finding the dominator is not enough. The common dominator might be located inside a "more deeply nested context" (like a loop) that does not contain our target MBB.

If we blindly placed the BLOCK there, we would cross scope boundaries illegally. We must "walk up" the scope tree (ScopeTops) to find the nearest valid scope level that encloses both the header and our target block.

In our example, bb.0 is the function entry and isn't nested inside any other scope, so this loop simply confirms bb.0 is valid.

MachineBasicBlock *LayoutPred = MBB.getPrevNode();

  // 2. Walk out of nested scopes if necessary
  // If the nearest common dominator is inside a more deeply nested context,
  // walk out to the nearest scope which isn't more deeply nested.
  for (MachineFunction::iterator I(LayoutPred), E(Header); I != E; --I) {
    if (MachineBasicBlock *ScopeTop = ScopeTops[I->getNumber()]) {
      if (ScopeTop->getNumber() > Header->getNumber()) {
        // Skip over an intervening scope.
        I = std::next(ScopeTop->getIterator());
      } else {
        // We found a scope level at an appropriate depth.
        Header = ScopeTop;
        break;
      }
    }
  }

Phase 3: Constraint Analysis (The Header)

Now that we have the correct Header block, we must decide where inside that block to insert the BLOCK instruction. We cannot just insert it at the end; we must respect the nesting of other markers.

We populate two sets:

BeforeSet: Instructions that must stay before our new BLOCK.
AfterSet: Instructions that must be inside (after) our new BLOCK.

The logic checks for existing LOOP, BLOCK, and TRY markers to ensure we wrap them correctly; it’s similar in concept to the decision-making code in placeLoopMarkers.

// 3. Decide where in Header to put the BLOCK.
  SmallPtrSet<const MachineInstr *, 4> BeforeSet;
  SmallPtrSet<const MachineInstr *, 4> AfterSet;

  for (const auto &MI : *Header) {
    // CONSTRAINT: Loop Interaction
    // If there is a previously placed LOOP marker and the bottom block of the
    // loop is above MBB, it should be after the BLOCK.
    if (MI.getOpcode() == WebAssembly::LOOP) {
      auto *LoopBottom = BeginToEnd[&MI]->getParent()->getPrevNode();
      if (MBB.getNumber() > LoopBottom->getNumber())
        AfterSet.insert(&MI);
    }

    // CONSTRAINT: Block/Try Interaction
    // If a previous marker ends before our current BLOCK ends, 
    // it should be placed inside (after) this BLOCK marker.
    if (MI.getOpcode() == WebAssembly::BLOCK ||
        MI.getOpcode() == WebAssembly::TRY ||
        MI.getOpcode() == WebAssembly::TRY_TABLE) {
      if (BeginToEnd[&MI]->getParent()->getNumber() <= MBB.getNumber())
        AfterSet.insert(&MI);
    }

    // Terminators (Branches) should go after the BLOCK to be contained by it.
    if (MI.isTerminator())
      AfterSet.insert(&MI);
  }

Phase 4: Insertion

With the constraints calculated, we can safely insert the BLOCK marker in the header. Then, we repeat the constraint analysis for the END_BLOCK marker in the current basic block (MBB).

Note how the code checks EndToBegin maps to ensure that if we are closing a BLOCK, we don't accidentally split a LOOP that should contain it.

// 4. Insert the BLOCK marker
  WebAssembly::BlockType ReturnType = WebAssembly::BlockType::Void;
  auto InsertPos = getLatestInsertPos(Header, BeforeSet, AfterSet);
  MachineInstr *Begin =
      BuildMI(*Header, InsertPos, Header->findDebugLoc(InsertPos),
              TII.get(WebAssembly::BLOCK))
          .addImm(int64_t(ReturnType));

  // 5. Analyze and Insert the END_BLOCK marker
  // (We reuse the sets but clear them for the new context)
  BeforeSet.clear();
  AfterSet.clear();

  for (auto &MI : MBB) {
    // If there is a previously placed END_LOOP marker and the header of the
    // loop is above this block's header, the END_LOOP should be placed after
    // the END_BLOCK.
    if (MI.getOpcode() == WebAssembly::END_LOOP ||
        MI.getOpcode() == WebAssembly::END_TRY) {
      if (EndToBegin[&MI]->getParent()->getNumber() >= Header->getNumber())
        BeforeSet.insert(&MI);
    }
  }

  // Mark the end of the block.
  InsertPos = getEarliestInsertPos(&MBB, BeforeSet, AfterSet);
  MachineInstr *End = BuildMI(MBB, InsertPos, MBB.findPrevDebugLoc(InsertPos),
                              TII.get(WebAssembly::END_BLOCK));

  // Register the new scope in the backend maps
  registerScope(Begin, End);
}

Explicit Locals Pass:

I will go briefly over this , but the pass main job is to insert WASM local.get and local.set instructions in places where non-stackified virtual registers are defined or used.

stackified registers are virtual registers that represent WASM which you can deal with implicitly when push/pop the evaluation stack without loading or storing them from/to WASM ‘locals‘ memory

Here is an example of transformation:

Before:

# *** IR Dump Before WebAssembly Explicit Locals (wasm-explicit-locals) ***:
# Machine code for function _Z9test_loopi: NoPHIs, TiedOpsRewritten
Function Live Ins: $arguments, $value_stack

bb.0 (%ir-block.1):
  successors: %bb.1(0x50000000), %bb.3(0x30000000); %bb.1(62.50%), %bb.3(37.50%)

  %6:i32 = ARGUMENT_i32 0, implicit $arguments
  BLOCK 64, implicit-def $value_stack, implicit $value_stack
  %3:i32 = CONST_I32 1, implicit-def dead $arguments, implicit-def $value_stack, implicit $value_stack
  %4:i32 = LT_S_I32 %6:i32, %3:i32, implicit-def dead $arguments, implicit-def $value_stack, implicit $value_stack
  BR_IF 0, %4:i32, implicit-def $arguments, implicit-def $value_stack, implicit $value_stack

bb.1:
; predecessors: %bb.0
  successors: %bb.2(0x80000000); %bb.2(100.00%)


bb.2 (%ir-block.4):
; predecessors: %bb.2, %bb.1
  successors: %bb.3(0x04000000), %bb.2(0x7c000000); %bb.3(3.12%), %bb.2(96.88%)

  LOOP 64, implicit-def $value_stack, implicit $value_stack
  CALL @_Z7do_workv, implicit-def dead $arguments, implicit $sp32, implicit $sp64, implicit-def dead $arguments, implici
t $sp32, implicit $sp64
  %9:i32 = CONST_I32 -1, implicit-def dead $arguments, implicit-def $value_stack, implicit $value_stack
  %8:i32 = ADD_I32 %6:i32, %9:i32, implicit-def dead $arguments, implicit-def $value_stack, implicit $value_stack
  %7:i32, %6:i32 = TEE_I32 %8:i32, implicit-def $arguments, implicit-def $value_stack, implicit $value_stack
  BR_IF 0, %7:i32, implicit-def $arguments, implicit-def $value_stack, implicit $value_stack

bb.3 (%ir-block.3):
; predecessors: %bb.0, %bb.2

  END_LOOP implicit-def $value_stack, implicit $value_stack
  END_BLOCK implicit-def $value_stack, implicit $value_stack
  RETURN implicit-def dead $arguments
  END_FUNCTION implicit-def $value_stack, implicit $value_stack

# End machine code for function _Z9test_loopi.

After:

# *** IR Dump After WebAssembly Explicit Locals (wasm-explicit-locals) ***:
# Machine code for function _Z9test_loopi: NoPHIs, TiedOpsRewritten
Function Live Ins: $arguments, $value_stack

bb.0 (%ir-block.1):
  successors: %bb.1(0x50000000), %bb.3(0x30000000); %bb.1(62.50%), %bb.3(37.50%)

  BLOCK 64, implicit-def $value_stack, implicit $value_stack
  %10:i32 = LOCAL_GET_I32 0, implicit-def $arguments
  %3:i32 = CONST_I32 1, implicit-def dead $arguments, implicit-def $value_stack, implicit $value_stack
  %4:i32 = LT_S_I32 %10:i32, %3:i32, implicit-def dead $arguments, implicit-def $value_stack, implicit $value_stack
  BR_IF 0, %4:i32, implicit-def $arguments, implicit-def $value_stack, implicit $value_stack

bb.1:
; predecessors: %bb.0
  successors: %bb.2(0x80000000); %bb.2(100.00%)


bb.2 (%ir-block.4):
; predecessors: %bb.2, %bb.1
  successors: %bb.3(0x04000000), %bb.2(0x7c000000); %bb.3(3.12%), %bb.2(96.88%)

  LOOP 64, implicit-def $value_stack, implicit $value_stack
  CALL @_Z7do_workv, implicit-def dead $arguments, implicit $sp32, implicit $sp64, implicit-def dead $arguments, implici
t $sp32, implicit $sp64
  %11:i32 = LOCAL_GET_I32 0, implicit-def $arguments
  %9:i32 = CONST_I32 -1, implicit-def dead $arguments, implicit-def $value_stack, implicit $value_stack
  %8:i32 = ADD_I32 %11:i32, %9:i32, implicit-def dead $arguments, implicit-def $value_stack, implicit $value_stack
  %7:i32 = LOCAL_TEE_I32 0, %8:i32, implicit-def $arguments
  BR_IF 0, %7:i32, implicit-def $arguments, implicit-def $value_stack, implicit $value_stack

bb.3 (%ir-block.3):
; predecessors: %bb.0, %bb.2

  END_LOOP implicit-def $value_stack, implicit $value_stack
  END_BLOCK implicit-def $value_stack, implicit $value_stack
  RETURN implicit-def dead $arguments
  END_FUNCTION implicit-def $value_stack, implicit $value_stack

# End machine code for function _Z9test_loopi.

Notice how virtual register %6 at instruction %6:i32 = ARGUMENT_i32 0 ... was transformed to %10:i32 = LOCAL_GET_I32 0 .....

also you can see that %8:i32 = ADD_I32 %6:i32, %9:i32 is %8:i32 = ADD_I32 %11:i32, %9:i32 now, because in old code virtual register %6 was used by the instruction and %6 is not stackified register (as we described previously) so a memory load get inserted %11:i32 = LOCAL_GET_I32 0, implicit-def $arguments

Exiting To Stack Form:

The final pass where the backend will exit from register flavor to stack flavor , this is done as part of pass WebAssemblyMCInstLower.

Let’s check the code:

static void removeRegisterOperands(const MachineInstr *MI, MCInst &OutMI) {
.......
  // Transform to _S instruction.
  auto RegOpcode = OutMI.getOpcode();
  auto StackOpcode = WebAssembly::getStackOpcode(RegOpcode);
  assert(StackOpcode != -1 && "Failed to stackify instruction");
  OutMI.setOpcode(StackOpcode);

  // Remove register operands.
  for (auto I = OutMI.getNumOperands(); I; --I) {
    auto &MO = OutMI.getOperand(I - 1);
    if (MO.isReg()) {
      OutMI.erase(&MO);
    }
  }
}

function removeRegisterOperands is straightforward it takes WASM instruction in it’s register form (MI) and the low level instruction (OutMI) which will get transformed to stack form to be used later for code emission.

First it gets the stack form opcode by passing the register form opcode to a map defined through TableGen

//===----------------------------------------------------------------------===//
  // WebAssembly Register to Stack instruction mapping
  //===----------------------------------------------------------------------===//

  class StackRel;
  def getStackOpcode : InstrMapping {
    let FilterClass = "StackRel";
    let RowFields = ["BaseName"];
    let ColFields = ["StackBased"];
    let KeyCol = ["0"];
    let ValueCols = [["1"]];
  }

  //===----------------------------------------------------------------------===//
  // WebAssembly Stack to Register instruction mapping
  //===----------------------------------------------------------------------===//

  class RegisterRel;
  def getRegisterOpcode : InstrMapping {
    let FilterClass = "RegisterRel";
    let RowFields = ["BaseName"];
    let ColFields = ["StackBased"];
    let KeyCol = ["1"];
    let ValueCols = [["0"]];
  }

  //===----------------------------------------------------------------------===//
  // WebAssembly 32 to 64-bit instruction mapping
  //===----------------------------------------------------------------------===//

  class Wasm64Rel;
  def getWasm64Opcode : InstrMapping {
    let FilterClass = "Wasm64Rel";
    let RowFields = ["Wasm32Name"];
    let ColFields = ["IsWasm64"];
    let KeyCol = ["0"];
    let ValueCols = [["1"]];
  }

If you are new to LLVM then be aware that OutMI.getOpcode(); doesn’t return WASM ISA binary encoding opcode but it returns a one special to the target where each instruction (e.g. ARGUMENT_I32 , ADD_I64, etc...) opcode is represented by it’s order in the enum generated by TableGen for WASM backend and because register form ISA is different than that of stack form then we have two set of opcodes , the real instruction binary opcode is used later by the emitter MCCodeEmitter when emitting .wasm file

In the loop if will iterate through registers operands and remove them

Final Words:

If you have read the post till here then you are awesome, I know there is a lot to read and understand but for me if someone is curious about certain topic then at least give them a detailed answer that fulfills their curiosity and help them connect the pieces thoroughly.

Thank you & Cheers 🥂.

SSA to Stack: Retargeting LLVM to Stack Machines—A Deep Dive Through the WebAssembly Backend

Introduction:

The Challenge:

The Two Flavors ISA:

ISA TableGen Definition:

Integer Arithmetic Instructions Example:

Opaque Registers And Register Units:

Registers:

Register Classes:

CFG Stackification:

The Problem:

The Core Stackification:

Phase 1: Identifying the Loop and Bottom

Phase 2: Determining the Insertion Point (Header)

Phase 3: Inserting the Loop Marker

Phase 4: Closing the Loop

Phase 1: Identifying the Header

Phase 2: Scope Adjustment

Phase 3: Constraint Analysis (The Header)

Phase 4: Insertion

Explicit Locals Pass:

Exiting To Stack Form:

Final Words:

Comments

More from this blog

LLVM-Clang Static Analyzers And SMT Solvers

Implementing C# consteval compile time evaluation — A Deep Dive into Roslyn's Compiler Pipeline

Inside Clang C++ Compile Time Evaluators : AST Evaluator & Bytecode Interpreter

The Chords The Colors The Registers The SSA Optimality

Command Palette

Introduction:

The Challenge:

The Two Flavors ISA:

ISA TableGen Definition:

Integer Arithmetic Instructions Example:

Opaque Registers And Register Units:

Registers:

Register Classes:

CFG Stackification:

The Problem:

The Core Stackification:

Phase 1: Identifying the Loop and Bottom

Phase 2: Determining the Insertion Point (Header)

Phase 3: Inserting the Loop Marker

Phase 4: Closing the Loop

Phase 1: Identifying the Header

Phase 2: Scope Adjustment

Phase 3: Constraint Analysis (The Header)

Phase 4: Insertion

Explicit Locals Pass:

Exiting To Stack Form:

Final Words:

Comments

More from this blog