SSA Register Allocation

Introduction

In the compiler world, one of the hardest problems to solve is register allocation, and it's no coincidence! But what is register allocation? Register allocation is the problem of assigning ℕ variables, temporaries, etc... to a limited 𝐾 number of physical registers.

In 1970, W.H.E. Day from IBM published their paper Compiler assignment of data items to registers, which was the paper that identified register allocation as a value assignment problem and solved it as an ILP (Integer Linear Programming) problem, which is an NP-Complete problem. Not long after, in 1971, IBM researcher John Cocke (the father of RISC) found that register allocation can be modeled as a graph coloring problem (an NP-Complete problem), but a real-world or practical solution was not found yet. Then in 1981, another IBM researcher, Gregory Chaitin, and his colleagues (John Cocke, Martin Hopkins & Marc Auslander) released one of the most influential papers in CS history: Register allocation and spilling via graph coloring, giving us the Chaitin register allocator (a.k.a. the Chaitin algorithm), which introduced the quadratic register allocator with linear graph coloring/assignment heuristic that suits real-world compilers.

The Problem

As you can see from the introduction, register allocation is an NP-Complete problem, making it hard, and the result is only as good as the heuristic used. In some modern real-world compilers, a variant of the Chaitin algorithm is used, called the Chaitin-Briggs algorithm. The algorithm yields better code than the original algorithm, and one such example is GCC IRA (integrated register allocator), which uses an algorithm based on Chaitin-Briggs for register allocation. Anyway, Chaitin-Briggs algorithm is still solving an NP-Complete problem, with the only difference being that total heuristic efficiency is significantly better.

There are a number of other algorithms used for register allocation other than graph coloring which are NOT NP-Complete, like linear-scan, which is primarily used in JIT compilers (CLR, V8, JVM, etc.) due to how fast it is. Unlike general graph coloring, linear scan operates over a graph called an interval graph, more specifically a linear interval graph, which you can think of as a single-dimension or linear graph where variables are represented as lines/intervals. Theoretically, coloring an pre-ordered interval graph can be done in O(N), but generated code in most cases is worse than that of coloring-based allocators like the Chaitin algorithm, meaning that it spills to memory more often.

As we know, solving an NP-Complete problem optimally is widely believed to be impossible in polynomial time. Therefore, an optimal algorithm typically requires exponential time **(**e.g. 𝑂(2^𝑁) ) to find an optimal solution, as it’s the case with graph coloring register allocators, which is infeasible for real-world compilers. Now the problem originates from the fact that most graph coloring solvers treat the interference graph as a general graph! But in the real world, most compilers represent programs in SSA form. You may be asking, "how does representing a program in SSA form make a difference?" Actually, in 2006, a computer scientist called Sebastien Hack (the founder of libFirm compiler backend) and his colleagues released a revolutionary paper Optimal register allocation for SSA-form programs in polynomial time, which proves that an interference graph built from a strict SSA program is a chordal graph and can be colored optimally in polynomial time with complexity of 𝑂(𝑁+𝑀).

Interference Graph

In compilers, an interference graph is an undirected graph that represents variables that are alive at the same time, where each vertex represents a variable and each edge represents a mutual interval point between variables (i.e., both variables are alive at the same time).

Building the interference graph is primarily done using liveness analysis on the whole graph, but there are smarter, more efficient ways to do so when the program is represented in SSA form (Sparse Liveness & Dominators).

Before comparing SSA chordal interference graphs to general arbitrary interference graphs, how about a code sample to work on for better visualization and understanding of the problem:

void example() {
    int a = 0; 
    int b = 0;
    int c = 0; 
    int d = 0;

    while (...) { 
        // the initial value of 'a' is used for the last time here. 'c' is born.
        // They do NOT overlap    
        c = a + 1; 
        d = b + 4;
        // a is rewritten since it's mutable
        // last use of 'c' here so there is NO interference between 'a' and 'c'
        a = c + 5;
        b = d + 2;
    }
}

General

Here we'll talk about a program that is NOT in SSA form, so the original program above will remain as-is.

Figure 1: Non-chordal interference graph

[a] interferes with [b] and [d]:
If you look at the loop in the code, based on liveness analysis, both definitions of a and b from a = c + 5 and b = d + 2 are alive at the next loop iteration, so both a and b are alive at the start of the loop and accordingly we say that they interfere.

Also notice that at one point in the program's life, a from a = c + 5 is alive at the same time as d from d = b + 4, so a and d are also alive at the same time and they interfere. Notice how a and c do not interfere with each other because they are never alive at the same time—whenever a is used, c is also defined, so there is no point in the program where both variables are used at the same time by any instruction or instructions.
[c] interferes with [b] and [d]:
Same logic here. Because the definition of b from b = d + 2 is alive/used after where c is defined (at c = a + 1, exactly at d = b + 4), thus c interferes with b. Also, c from the same point c = a + 1 is alive at the same time as d from d = b + 4, since c's new definition was used later at a = c + 5; and d's new definition was used later at b = d + 2. There was a point where two variables were alive at the same time, so both c and d interfere.

Figure 2: Sequence diagram representing interval timeline of interference graph

A careful reader would notice something crucial from the analysis above: for interference graph computation, all we care about is which definition of a variable reaches which point, which is a property that can be deduced sparsely and easily when the program is represented in SSA form due to one of its core properties (single reaching definition), which I am going to demonstrate next.

SSA-Chordal

I am going to use LLVM IR and LLVM optimizer to get the SSA representation of the previous C++ program.
The output of running opt -passes=mem2reg:

define dso_local void @example() {
entry:
  br label %while.cond

while.cond:
  %b.0 = phi i32 [ 0, %entry ], [ %b.1, %while.body ]
  %a.0 = phi i32 [ 0, %entry ], [ %a.1, %while.body ]
  %cmp = icmp slt i32 ......
  br i1 %cmp, label %while.body, label %while.end

while.body:
  %c.1 = add nsw i32 %a.0, 1
  %d.1 = add nsw i32 %b.0, 4
  %a.1 = add nsw i32 %c.1, 5
  %b.1 = add nsw i32 %d.1, 2
  br label %while.cond

while.end:
  ret void
}

Before heading to explaining what is going on with the interference graph of the SSA code, let me first explain what a chordal graph is!

A chordal graph, also known as triangulated, is an undirected graph where any subgraph that forms cycles of 4 vertices or more has a chord. In other words, only cycles of three vertices (triangles) can exist! The chord in the graph is what connects non-adjacent nodes, forming a triangle.

The graph below is a chordal graph! Notice that there are no induced cycles longer than three (i.e., ≥ 4). The chords here are the edges between n5 and n4, n5 and n2.

Induced cycles are cycles where no two non-adjacent vertices have an edge between them (imagine an empty square as an example).

Notice how the graph looks like triangles connected to each other.

Figure 3: Arbitrary Chordal Graph

An interference graph built from a strict SSA program is always a chordal graph. If you look at the interference graph of the SSA code built from the C++ sample code above, you can see how the interference graph here is chordal since it's linear and doesn't have any cycles of 4 or more.

Figure 4: Interference graph of SSA code

The interference graph here is simple and explains itself, but one thing I want to clarify about Phi nodes/instructions to avoid any confusion! In the SSA code, there is the loop header which naturally contains phi nodes since data is flowing from code falling through to the loop header and from the loop latch (the back edge):

while.cond:
  %b.0 = phi i32 [ 0, %entry ], [ %b.1, %while.body ]
  %a.0 = phi i32 [ 0, %entry ], [ %a.1, %while.body ]

Some readers might ask: "The definition of %b.1 is alive at the same time as %b.0 (since one flows into the other), so shouldn't they interfere?"

This is where the magic of SSA Dominance comes in.

In SSA form, a value created by a Phi node represents a definition that dominates future usages until it hits its dominance frontier. In our example, %b.0 dominates the loop body. This means whenever b is used inside the loop (before re-assignment), the compiler uses %b.0 because it is the single reachable definition.

When b is reassigned in the loop (%b.1 = ...), this creates a totally new definition. From that point on, %b.1 becomes the dominating definition.

However, at the loop header, we have a merge point. We have two reaching definitions: 0 (from entry) and %b.1 (from the back-edge). The Phi node inserts a new definition, %b.0, to unify them. Because %b.0 is the new dominator for the loop, it effectively supersedes %b.1 at the loop entry. The live range of %b.1 effectively ends where %b.0 begins, preventing the kind of overlapping "interference cycles" we see in non-SSA graphs.

Figure 5: Sequence diagram representing interval timeline of SSA interference graph

Register Allocation And Graph Coloring

As mentioned in the introduction, register allocation is the problem of assigning variables or expressions ℕ to a limited number of registers 𝐾.
Note: from now on I am going to use RA as shortcut to register allocation

The definition above is the formal definition of the problem, but many compiler developers like to see or define it as "the problem of reducing spilling data to memory" since advanced register allocators are not just about assigning N to K but rather about optimizing or minimizing the cost of spilling and keeping hot data in registers with number of other parameters to consider

And what about graph coloring?
Graph coloring is the problem of assigning ℕ colors to 𝐾 vertices in graph 𝐺 where no two or more connected vertices have the same color ℕ.

If we map graph coloring to RA, you can see how RA just fits the problem:

Colors are equivalent to registers
Vertices are equal to variables and/or expressions
Vertices connected to each other cannot have the same color—i.e., they are alive at the same time and cannot be assigned to the same register

Canonical Stages of RA

RA operates over the interference graph and, as mentioned above in the quote, real-world register allocators don't just assign variables to registers—they go through multiple stages (either independently or iteratively) to achieve the best possible practical result (optimality is constrained by NP-Completeness for non-SSA/Chordal graphs).

The following stages are common to most RA architectures, regardless of whether they use graph coloring:

Build Interference Graph: The allocator will build the interference graph based on liveness analysis
Live Range Splitting: This one specifically is not really mandatory but rather a critical optimization for generating the best possible code. The allocator will split the live interval of a single variable into multiples, which gives the allocator the chance to allocate different registers to the same variable at multiple points in the program. For example, a variable called V can be split into v1 and v2 at different points of the program, allowing the allocator to assign different registers to each version.
Coalescing: The compiler will remove unnecessary register copies like mov R1,R2 or similar. For example, the compiler can check that a copy from R2 to R1 is redundant since R2 holds the same value as R1 throughout the program's life, and accordingly we can remove mov R1,R2 and replace all usages of R2 with R1.
There is multiple coalescing strategies like (Aggressive, Conservative, Iterative, Optimistic, Biased Coloring) which decides what/when/how to do the coalescing some like Biased Coloring don’t merge nodes even

Note: In order for Coalescing to work with the previous example, R2 must not interfere with R1—i.e., R2 must not be alive at the same time as R1, because if that's the case, then the definition of both registers is used by other instructions.
Spilling: is the act of loading and storing from/to memory instead of registers. This happens when the number of simultaneously alive variables is greater than the number of available registers.
Assignment: is when RA assigns register 𝐾 to variable ℕ

Mind you that some compilers like LLVM have more optimizations involved, like Dead Code Elimination and Rematerialization.

One register allocator that LLVM has is a custom priority queue based register allocator (Greedy Allocator) that operates over OUT-OF SSA MIR

Iterative RA

In such register allocators, the steps or stages above are part of a single pass, since in iterative RA the allocator may rebuild the graph after spilling and restart all over again, since a spilled node/variable is no longer colorable and accordingly we have a new interference graph.
One example of such RA is Chaitin-Briggs.

Pipelined RA

Formally, in academia there is no such classification as "pipelined RA," but the distinction is legitimate. Such register allocators execute the steps above one after another in a pipelined or standalone fashion. Such allocators don't backtrack like iterative RA—once a step is executed, it's done—but one step may iteratively and without backtracking try to preprocess the graph, like the Hack SSA allocator fixing Local Pressure Spike in the program. Allocators of such design or style operate over SSA programs, like the Hack SSA allocator.

Chaitin-Briggs Allocator

The de facto graph coloring register allocator. The allocator can work against arbitrary graphs (chordal or not) and has a heuristic of 𝑂(𝑁²), but in practice the allocator rarely hits the worst case. And as I have mentioned already, the allocator solves an NP-Complete problem even if the graph is chordal, since it's not aware of its properties—again, the allocator is a general allocator.

I will go through the steps of the allocator and show an example of how it works. In this example, we will be dealing with a CPU with two registers {R1,R2}:

int x = 1; 
int y = 2;
int t = 0;

while(true) {
    t = x + y;  // t uses x and y
    x = y;      // x is updated
    y = t;      // y is updated
}

Renumbering: A.k.a. liveness analysis—it will analyze and collect live intervals of variables and temporaries.
Build: Using liveness analysis info, the allocator will build the interference graph. In our example, all variables/nodes are connected to each other, forming a triangle.
Coalesce (Optimization): If there are any unnecessary copies like (x=y), nodes x and y in the interference graph get merged and the copy instruction will be eliminated. The allocator will do so to help optimize code performance by removing redundant or unnecessary copies and reducing register pressure.
In our example, there are no such copies.
Simplify: This step builds the coloring stack.
First it checks all nodes with degrees < K where K is the number of registers or colors. If a node's degree is less than K, then it will get pushed to the stack, which means that it's colorable because such a variable/node interferes with fewer variables than available registers. When a node gets pushed to the stack, it also gets removed from the graph, and accordingly the degree of nodes connected to that node will get reduced. The simplify step repeats this until the graph is empty and the stack is full.

In our example, we have two registers {R1,R2} so K = 2, and all variables/nodes (x, y, t) have degree of 2, which means K == node degree (2 == 2) for all nodes, so no node will get pushed to the stack and the graph remains as is.
Spill: In Chaitin-Briggs, spilling is optimistic, which means when a candidate for spilling is found, we push it to the coloring stack instead of spilling it directly, in hope that when coloring happens later there will be an available color/register to assign to such variable. Candidate priority is determined by the following formula:

$$\text{Priority} = \frac{\text{Spill Cost}}{\text{Degree}}$$

where spill cost depends on the depth of the loop that the variable is being used in, and candidates with lower priority will get spilled first.

In our example, all nodes are theoretically equal, so I am going to choose t. Because of that, variable t will get removed from the graph and pushed to the stack, and variables x and y will have degree 1 since t is no longer connected to x and y. Now the spiller has variables with degree less than K, so degree < K (1 < 2). Let's assume x will get pushed to the stack as being colorable, and now we are left with y with degree 0, so (0 < 2), and it will also get pushed to the stack. Now the graph is empty and the stack is full.

The stack (Top-Bottom) looks like: [y, x, t]
Select: A.k.a. coloring. Here the allocator will pop variables from the stack and assign registers or colors to them. In case of variables annotated as possible spill, either the allocator successfully colors them, or otherwise a spill needs to take place by inserting loads and stores in place of defs and uses of that variable.

In our example, we have stack [y, x, t(possible spill)]:
1. Pop (y): y interferes with x and t, but both neighbors have no colors, so we assign y to R1
2. Pop (x): x interferes with y and t, but here y is already assigned to register R1, so the allocator chooses another available color/register (R2)
3. Pop (t): t is adjacent to both x and y, and unfortunately the allocator doesn't have any further registers to assign to it, so a spill is needed—LOAD t and STORE t will get inserted for each use/def of t

If step Select spilled any variable (like t), then the whole allocation process starts all over again.

To avoid any sort of bias to any approach, a smart Chaitin-Briggs will involve a live range splitting step, which can reduce register pressure and accordingly improve coloring.

Hack SSA Based Allocator

Hack SSA allocator is a global register allocator that tries to solve register allocation for a special kind of graph (as you may have guessed): chordal graphs. Unlike the Chaitin-Briggs allocator, which handles any arbitrary graph, the power of the Hack SSA allocator comes from the fact that it utilizes chordal graph properties (the graph's chromatic number equals the maximum clique size) and SSA form properties (Single Reachable Dominance).
A real-world implementation of such a register allocator can be found under the libFirm project, which was founded by Sebastien Hack, the main author of Optimal register allocation for SSA-form programs in polynomial time.

So how does it work? I will use the same sample from Chaitin-Briggs and see how it looks in SSA form, then how the allocator works against it:

loop_header:

    x0 = phi(init_x, x1) 
    y0 = phi(init_y, y1)

    t0 = x0 + y0    // x0 dies here. t0 is born.
    x1 = y0         // y0 dies here. x1 is born.
    y1 = t0         // t0 dies here. y1 is born.

    jump loop_header // Back-edge carries x1, y1

Now, before starting with the main steps, I need to clarify some things about liveness analysis and interference graphs when dealing with SSA form programs:

Liveness: The allocator can deduce liveness of a variable from the dominance tree. The dominance tree is one of the core data structures for one of the most classic and foundational SSA build algorithms (Cytron algorithm), and even if you didn't use it for building an SSA program, it can easily be deduced from an SSA program. The crucial property of the dominator tree and dominance here is that when the allocator wants to check if variables 𝑨 and 𝐵 interfere, the allocator checks their definitions' dominance relationship (specifically, if Def(𝑨) dominates Def(𝐵) and Use(𝑨) is still live at Def(𝐵), then 𝑨 and 𝐵 interfere). These two lemmas are called Budimlić lemmas.
The answer to those two queries (classically) can be done using:
1. Dominance: by Dominator tree traversal
2. Reachability: by Iterative Dataflow analysis

But because the program is in SSA form, we can do Sparse Liveness Analysis and do Fast Liveness Analysis, which was released in the paper Fast Liveness Checking for SSA-Form Programs. I will not go through paper details, but to compute dominance and liveness the following preprocessing happens:

DFS (pre and post) Numbering: The compiler will give pre/enter and post/exit numbers for each node in depth-first fashion, so when the first query "Def(A) dominates Def(B)" needs to be answered, we use the formula below:

$$\text{pre}[A] \le \text{pre}[B] \quad \text{AND} \quad \text{post}[B] \le \text{post}[A]$$
Loop Forests: Detect loops in the CFG and mark such loops. This is important to know where the loops start (header) and flow back (latch) to know if variable 𝑨 is used in a loop and that it flows back to the header. If that's the case, then we say 𝑨 is alive for the whole loop. This helps answer "does Def(𝐵) reach Use(𝑨)" in loops easily.
Reduced Graph Reachability: The reduced graph will be the full CFG excluding back edges (a DAG basically):

$$G_{red} = \text{CFG} - \text{BackEdges}.$$

It answers "does Def(𝐵) reach Use(𝑨)" without considering back edges.

Numbering is used to answer the query "Def(𝑨) dominates Def(𝐵)" and (2 & 3) to answer "Use(𝑨) is still live at Def(𝐵)".

Interference Graph: With SSA, you don't really have to build an interference graph explicitly since it's implicit! Now you may have the following questions:
1. Q: Why liveness analysis in the first place?
  A: As mentioned early in this post, liveness analysis is mainly for checking interference and later building an interference graph. But unlike allocators like Chaitin-Briggs which builds a whole new graph, the liveness info from this pass is to answer future queries when Spilling and Coloring passes run, because at each instruction the allocator will need to ask which variables interfere and which are dead—both of which can be answered in O(1) due to the preprocessed data structures built before allocation (The Reduced Graph, Loop Forests, etc.).
2. Q: What do you mean by implicit?
  A: If you take the dominator tree of any SSA program, some node N has a definition of V which resembles the birth of V. Any later node that is dominated by N and uses V represents an extension of the interval. The subtree starting from N as root and all nodes that use V represent the interval of V. The same concept applies to any other variable like X. If you take the intersection of X's interval and V's interval, which are subtrees in the dominator tree, then the resulting graph is a chordal graph (will explain the proof later).
3. Q: The chordal properties of SSA are kind of clear, but how is the interference graph really computed here?
  A: By visiting the dominator tree in pre-order fashion, because when doing so we get PEO (perfect elimination ordering) of the graph, and because it's PEO the graph is chordal, and that graph is the interference graph.
  
  On a side note, if you want to compute the PEO of graph G to check that it's chordal (since by definition a graph is chordal if it has a PEO), the algorithm is called maximum cardinality search, which does exactly that.

Now the official steps:

Spilling: Because we know the SSA graph is optimally colorable (because chordal graphs' chromatic number equals maximum clique), the allocator can walk the CFG knowing that at each point/instruction, if the number of simultaneously alive variables is greater than chromatic number K, the allocator needs to spill. So it measures register pressure at each instruction to know if it should spill or not by building a MaxLive set (register pressure set) at each point in the program. In the Hack allocator, the spilling candidate is determined by heuristics (like furthest-next-use or Belady's MIN), which accordingly cause load and store instructions to be inserted.

In our example: We start with an empty MaxLive set (let's call it MLS) which represents register pressure, where the size of the set represents register pressure at any given point.
- at x0 = phi(init_x,x1), since only one variable was defined starting from this point, MLS={x0}, which also means that register pressure equals 1 here.
- at y0 = phi(init_y,y1), y0 is born here, so MLS={x0,y0}
- at t0 = x0 + y0, variable t0 is born and it's x0's last use, so t0 can take x0's place in the set: MLS={t0,y0}
- same logic applies to the rest of the instructions; here register pressure never exceeds the threshold of K=2 at any point
Coloring: Now that the graph is guaranteed to be K-colorable (thanks to the Spilling pass), this pass assigns registers/colors to each variable. It does so by walking the DOM tree in pre-order fashion, which yields a PEO! Now simply, the allocator, for each instruction, will assign each variable a register or color. The allocated register set starts full, and whenever a variable is defined, it gets assigned a color. The allocator will check if any of the current instruction's operands/variables end here so it frees a register. The set propagates in parent-child fashion obviously, and that's it. Also notice how the heuristic is linear 𝑂(|𝑉| + |𝐸|).

In our example, based on data built by the Spilling pass:
- at x0 = phi(init_x,x1) where MLS={x0}, the allocator will arbitrarily assign x0 to R1
- at y0 = phi(init_y,y1) where MLS={x0,y0}, the allocator will assign y0 to R2 since it's the last available register
- at t0 = x0 + y0 where MLS={t0,y0}, the allocator will assign t0 to R1 since it's x0's last use
- same logic applies to the rest of the program

Coalescing: It will remove any unnecessary copies like (x = y). This is often done by Biased Coloring during the Coloring pass instead of node merging before coloring (with biased coloring, x and y will be assigned the same color if and only if they have affinity/bias edge like phi or copy between them e.g. mov a b ). Coalescing is really hard with SSA programs if done the classical way since it may break chordality, and also coalescing is an NP-Complete problem even with SSA.

In our example, there's no coalescing needed.
SSA Destruction And Phi Elimination: Traditional allocators destroy SSA before allocation, which destroys the "chordal" property and makes coloring hard. Hack destroys SSA after allocation. Phi nodes either get replaced by parallel copies inserted at the end of phi node predecessor basic blocks, or get replaced by a swap/exchange instruction.

In our example, if you look at the colored graph above from the Coloring pass, you can see that:
- x0 is alive in R1
- x1 is alive in R2
- y0 is alive in R2
- y1 is alive in R1

For sake of simplicity, let's assume that init_x and init_y are literals. We can see that x0 has an incoming value from x1, so R1 depends on R2, meaning R1=R2. The opposite dependency is happening with y0 and y1, so R2=R1. What the allocator can do is one of two options:

The backend has a swap/exchange instruction (X86): the allocator will insert an XCHG instruction before jumping backward (at jump loop_header a.k.a. JMP loop_header in x86) when recognizing such a case, for example:

        SWAP R1 R2 
        JMP loop_header

2. Parallel Copy: swapping of registers will happen by spilling one of the registers to memory/stack and then swapping the values:

        STORE R1 -> [temp_stack]    // Save y1 (currently in R1)
        MOV R1, R2                  // Move x1 (from R2) into R1.
        LOAD R2 <- [temp_stack]     // Load y1 into R2.
        JUMP loop_header

the stack slot here is called a scratch some architectures like RISC-V has dedicated scratch registers for such case but X86 doesn’t.

Notice how because the program got out of SSA and is no longer chordal, all previous hypotheses are broken and the allocator had to spill to memory.

One can notice a couple of things:

In the Hack SSA sample, the chromatic number K is equal to 2 since the largest clique consists of two vertices at most (no vertex is connected to more than two other vertices).
Hack SSA allocator didn't spill (unless parallel copy with no scratch register was used) while doing spill checks and assigning registers, unlike Chaitin-Briggs. But here I want to clarify one thing so there is no bias! If Chaitin-Briggs ran against the same SSA graph, it would not have spilled, but this is not always the case! Hack SSA allocator is aware of different SSA and chordal graph properties, unlike Chaitin-Briggs which handles any graph as a general graph.
Hack SSA allocator solved the most challenging aspects of register allocation (spilling and reg assignment) in SSA (optimally) and left the handling of phi node destruction and lowering till the end which by then is a non-SSA, non-chordal graph—a small sacrifice relative to doing full register allocation out of SSA.

That being said it doesn’t mean the SSA based register allocators are always better than general ones ! actually let me briefly mention some of the challenges with SSA based register allocators

SSA RA’s sometimes tends to spill more due to late phi elimination.
Pre Coloring: ABI enforces certain register usage like caller/callee saved registers and function argument passing , so pre coloring is required for that which can break the chordal property because it introduces fixed nodes that cannot be trivially colored in PEO order.

Why does it work?

We have talked about how SSA interference graphs are chordal, but why is it like that? Sebastien Hack proved this in his paper, which I am going to talk about as a developer and in developer fashion, not mathematically (which isn't my strong suit).

The SSA Dominator Proof

Hack proved that the SSA interference graph is chordal by contradiction and used the empty square or diamond graph for that.

Let's assume that the SSA interference graph is not chordal and a square can be shaped:

For the sake of proof, let's create the dominator tree of CFG 𝑮, where A, B, C are basic blocks:

In Strict SSA form, the live range of a variable is contiguous in the dominator tree, so the live range of a variable 𝑋 is a subtree in the dominator tree. In order for the SSA interference graph to not be chordal, there must be a way to connect/intersect the live ranges like this: 𝑉₁→𝑉₂→𝑉₃→𝑉₄→𝑉₁ in the DOM tree in order for the SSA interference graph to be not chordal.

Let's reconstruct the square interference graph now with SSA dominance properties in mind. We start by adding 𝑉₁ and 𝑉₃ live ranges:

𝑉₁ → {A} : 𝑉₁ is only alive in basic block A
𝑉₃ → {C} : 𝑉₃ is only alive in basic block C

Now based on the original square interference graph, 𝑉₂ interferes with 𝑉₁ and 𝑉₃, which means that the live range of 𝑉₂ extends the whole DOM tree from A to C:

𝑉₂ → {A, C}

Again based on the original square interference graph, 𝑉₄ interferes with 𝑉₁ and 𝑉₃, which means that the live range of 𝑉₄ also extends the whole DOM tree from A to C, so now we have 𝑉₄ and 𝑉₂ having similar live intervals:

𝑉₂ → {A, C} : we have calculated this above
𝑉₄ → {A, C}

Based on all of the above, we know the following:

𝑉₂ interferes with 𝑉₁ and 𝑉₃
𝑉₄ interferes with 𝑉₁ and 𝑉₃
𝑉₂ has the same live interval as 𝑉₄

Because 𝑉₂ live interval overlap with 𝑉₄ since they are the same , that means 𝑉₂ interferes with 𝑉₄, which creates a chord (the blue edge) 😊👌

End of Story

I hope the article gave you a deep look into SSA register allocation and its power. I know the article was dense, but I think it's worth it (hopefully). Thank you for reading this and please if you noticed any technical error in the post let me know that in comments.

The Chords The Colors The Registers The SSA Optimality

Introduction

The Problem

Interference Graph

General

SSA-Chordal

Register Allocation And Graph Coloring

Canonical Stages of RA

Iterative RA

Pipelined RA

Chaitin-Briggs Allocator

Hack SSA Based Allocator

Why does it work?

The SSA Dominator Proof

End of Story

Comments

More from this blog

LLVM-Clang Static Analyzers And SMT Solvers

Implementing C# consteval compile time evaluation — A Deep Dive into Roslyn's Compiler Pipeline

Inside Clang C++ Compile Time Evaluators : AST Evaluator & Bytecode Interpreter

SSA to Stack: Retargeting LLVM to Stack Machines—A Deep Dive Through the WebAssembly Backend

Command Palette

Introduction

The Problem

Interference Graph

General

SSA-Chordal

Register Allocation And Graph Coloring

Canonical Stages of RA

Iterative RA

Pipelined RA

Chaitin-Briggs Allocator

Hack SSA Based Allocator

Why does it work?

The SSA Dominator Proof

End of Story

Comments

More from this blog