Compiler Optimization Strategies and Implementation

Introduction

This is a compiler optimization report for my course project at JHU, for which I got the highest grade in the class and received very positive feedback from Dr. David Hovemeyer. Feel free to check out the source code here.

image-20240222102942257

image-20240222102958520

Summary

After implementing several optimization processes, including high-level optimizations (local value numbering, local copy propagation, local constant folding, dead store elimination, local register allocation) and low-level optimization (peephole), we have observed a remarkable enhancement in performance.

Here, we benchmarked the compiler using example29.c and example31.c , which are relatively intricate programs involving millions of computations.

Runtime result for example29.c

The execution time was reduced from 3.67 seconds to 0.74 seconds, which represents a performance improvement of approximately 396%.

Version Average running time(over 3 runs)
Unoptimized 3.67s
Optimized(local value numbering, local copy propagation,local constant folding, local register allocation) 1.40s
Optimized(peephole) 0.74s
Gcc (-O2 optimization) 0.47s

Runtime result for example31.c

The execution time for example31.c decreased from 3.49 seconds to 0.86 seconds, indicating a performance increase of about 306%.

Version Average running time(over 3 runs)
Unoptimized 3.49s
Optimized(local value numbering, local copy propagation,local constant folding, local register allocation) 1.33s
Optimized(peephole) 0.86s
Gcc (-O2 optimization) 0.48s

In subsequent chapters, I will provide a detailed explanation of these improvements, the optimizations used, and their effectiveness.

Optimizations

There are a variety of optimization techniques we can choose to enhance the run-time efficiency of the target programming language. These techniques can be applied to either high-level or low-level code and in either a local or global scope. An optimally optimized measure, like what the GCC project achieves, involves years of dedicated research by hundreds of scholars. Given our limited time this semester, our strategy is to select optimization algorithms that offer significant performance improvements and are feasible to implement. Following Dr. Hovermeyer’s advice, my optimization strategies are outlined below.

High-Level Optimization

During the high-level code optimization process, I implemented the following strategies:

  • Local Value Numbering: This process also includes Local Constant Folding, Dead Store Elimination, and Local Copy Propagation.
  • Local Register Allocation

Local optimization

As you can see, I used many optimization schemes that are performed in a local scope, such as local value numbering, local constant folding, local copy propagation, and local register allocation. So what does ‘local’ mean?

Here, what I meant by ‘local’, is to conduct optimization in basic blocks, which are units of a control-flow graph. In a basic block, the last instruction must be a branch to another block.

image-20231103202153757

[Image: Dr. Hovemeyer, Compilers and Interpreters, Fall 2023]

We conduct our local optimization on each basic block of the program’s control-flow graph. Performing optimization in a global scope absolutely would be better, while it requires more effort and is considerably more complex.

Low-level optimization

In low-level code optimization, I conduct the following strategy.

  • Peephole optimziation

High-level Optimization

Instead of diving directly into optimizations at the low-level assembly language code, we can take advantage of immediate representations of source programs. This approach allows us to perform optimizations progressively, moving from shallower to deeper levels. It provides a more natural workflow and prevents confusion that might arise from working directly with intricate low-level code, which can be challenging to debug. A best practice for developing complex projects, like a compiler, is to segment the overall goal into smaller, manageable tasks and tackle them individually.

Thanks to Dr. Hovermeyer’s contribution, we can utilize immediate representations of the source code, which enables us to optimize the code in the high-level phase first and then seamlessly translate these optimizations into the low-level code.

My strategy begins with conducting local value numbering on the high-level code.

Local Value Numbering

Local value numbering is a strategy to eliminate replicated computations in the code. The high-level code was generated by traversing each AST node with limited context, so it’s hard to eliminate repeated computation during that process. With the generated high-level code, we can analyze value flow in a broader context, annotate each operand with a value number, and record the computations performed on each value number. If we later find a computation that we have already computed, we can simply reuse and apply the computed result to the destination operand without computing it again. The following code snippet from example29.c will be used to further explain this concept.

image-20231210223309156

For instance, the instruction localaddr vr42, $4000000 was computed before the instruction localaddr vr57, $4000000, and we assume that the value in vr42 remains unchanged. In this case, localaddr vr57, $4000000 is a redundant computation. We can avoid this redundancy by moving the value from vr42 to vr57.

To achieve this, we allocate a Value Number to every register and record every computation in a mapping structure {opcode, value number of source operand1, value number of source operand2 (if any)} -> value number of the destination operand. In the instruction localaddr vr42, $4000000, vr42 is assigned value number A, and $4000000 is assigned value number B. The mapping would be {localaddr, B} -> A.

When encountering the instruction localaddr vr57, $4000000, we check if each operand has been previously assigned a value number. If not, we assign a new value number. In this case, vr57 receives a new value number C, and $4000000 retains its value number B. Now, we have a key for the current computation mapping as {localaddr, B}. Upon checking the mappings, we find an existing key with the same computation, indicating redundancy. Therefore, we modify vr57 from value number C to A and eliminate the redundant computation by replacing the instruction with mov_q vr57, vr42

The same process applies to the instruction mul_l vr43, vr13, vr10.

image-20231210231706428

The elimination process of Local Value Numbering typically occurs in several rounds since each round may create opportunities for further optimization. After several iterations, the result is as follows:

image-20231210234811672

After conducting Local Value Numbering, we ensure that each basic block contains the necessary computations without redundancy. Our next goal is to eliminate unnecessary allocated virtual registers. Many registers, serving as temporary holders for values during computations, often become dead - unused after their computation. We leverage Copy Propagation and Dead Store Elimination to remove most of these redundant registers. Concurrently, we perform Constant Folding, computing constant values during the compiling time. This straightforward approach, when combined with Copy Propagation, helps improve the performance of the target language by reducing computational instructions.

Improvement of Performance(LVN)

Runtime result for example29.c

The execution time was reduced from 3.67 seconds to 3.06 seconds, which represents a performance improvement of approximately 20%.

Version Average running time(over 3 runs)
Unoptimized 3.67s
Optimized(local value numbering) 3.06s

Runtime result for example31.c

Same enhancement for example31.c, the execution time was reduced from 3.49 seconds to 3.06 seconds, involving an approximate increase in performance of 14%.

Version Average running time(over 3 runs)
Unoptimized 3.49s
Optimized(local value numbering) 3.06s

Local Copy propagation

Copy propagation is a strategy to reduce the usage of registers. After performing local value numbering, each register is assigned a unique value number. Often, many registers share the same value number.

The central idea of this strategy is to unify all registers with identical value numbers into a single register based on specific rules. This unification creates opportunities to eliminate redundant registers. In practice, for registers sharing the same value number, we sort them by their register number and identify the smallest number as the unified register number. Then, we replace each register that shares this value number with the unified register number. If a value number stores an immediate value, we straightforwardly replace the immediate value with the register.

Below are code snippets from example29.c, illustrating how copy propagation works in practice:image-20231211010825890

This approach streamlines the code by efficiently managing register assignments, which contributes to the overall performance optimization of the compiled program.

Improvement of Performance(LVN, Local Copy propagation)

Runtime result for example29.c

The execution time was reduced from 3.67 seconds to 3.08 seconds, which represents a performance improvement of approximately 19.2%.

Version Average running time(over 3 runs)
Unoptimized 3.67s
Optimized(local value numbering) 3.06s
Optimized(local value numbering, Local Copy propagation) 3.08s

Runtime result for example31.c

Same enhancement for example31.c, the execution time was reduced from 3.49 seconds to 3.06 seconds, involving an approximate increase in performance of 14%.

Version Average running time(over 3 runs)
Unoptimized 3.49s
Optimized(local value numbering) 3.06s
Optimized(local value numbering, Local Copy propagation) 3.06s

Local Constant folding

Performing constant computations during compile time is an effective way to simplify a portion of a program’s calculations. By resolving these computations early, we reduce the computational load in the target language, thereby enhancing the program’s performance

The following image illustrates an example of Local Constant Folding in action within example05.c:

image-20231211133815405

Improvement of Performance(LVN, Local Copy propagation, Local Constant folding)

Runtime result for example29.c

The execution time was reduced from 3.67 seconds to 3.02 seconds, which represents a performance improvement of approximately 21.5%.

Version Average running time(over 3 runs)
Unoptimized 3.67s
Optimized(local value numbering) 3.06s
Optimized(local value numbering, Local Copy propagation) 3.08s
Optimized(local value numbering, Local Copy propagation, Local Constant folding) 3.02s

Runtime result for example31.c

Same enhancement for example31.c, the execution time was reduced from 3.49 seconds to 3.07 seconds, involving an approximate increase in performance of 13.7%.

Version Average running time(over 3 runs)
Unoptimized 3.49s
Optimized(local value numbering) 3.06s
Optimized(local value numbering, Local Copy propagation) 3.06s
Optimized(local value numbering, Local Copy propagation, Local Constant folding) 3.07s

Dead Store Elimination

After rounds of copy propagation and constant folding, we unify all registers with the same value number into a single register. This unification creates opportunities to eliminate dead registers.

A register is considered ‘alive’ from its point of definition until its last use. Conversely, a register is ‘dead’ at a point where it will not be used again or is set to be redefined in the future.

For example, consider the following code segment: vr12 serves as a temporary register to hold the immediate value 5 for an addition computation. After this computation, vr12 is redefined, making it dead post-addition and alive again for a new value and definition.

1
2
3
4
5
mov_l vr12, $5		
mov_l vr13, $16
add_l vr14, vr12, vr13
mov_l vr15, vr14
mov_ vr12, $17

If a register is dead as a destination operand in an instruction, that instruction can be eliminated, as the value in the register will no longer be used. By integrating dead store elimination with copy propagation, we can significantly reduce the use of temporary registers.

image-20231211143032497

The code snippet from example29.c demonstrates an instance of dead store elimination. The comments on the right side show the liveness analysis results, obtained through global data flow analysis. This analysis provides a comprehensive view of whether a value is alive or dead at any given point in the code.

Also, a notable case here is vr16 becomes dead after the mov_q (vr16), vr0 operation. However, we do not consider (vr16) as a dead store since the value is stored in memory rather than in the register. Our liveness analysis does not extend to memory value analysis.

image-20231211144034407

Improvement of Performance(LVN, Local Copy propagation, Local Constant folding, Dead Store Elimination)

Runtime result for example29.c

The execution time was reduced from 3.67 seconds to 2.54 seconds, which represents a performance improvement of approximately 44.5%.

Version Average running time(over 3 runs)
Unoptimized 3.67s
Optimized(local value numbering) 3.06s
Optimized(local value numbering, Local Copy propagation) 3.08s
Optimized(local value numbering, Local Copy propagation, Local Constant folding) 3.02s
Optimized(local value numbering, Local Copy propagation, Local Constant folding, Dead Store Elimination) 2.54s

Runtime result for example31.c

Same enhancement for example31.c, the execution time was reduced from 3.49 seconds to 2.37 seconds, involving an approximate increase in performance of 47.3%.

Version Average running time(over 3 runs)
Unoptimized 3.49s
Optimized(local value numbering) 3.06s
Optimized(local value numbering, Local Copy propagation) 3.06s
Optimized(local value numbering, Local Copy propagation, Local Constant folding) 3.07s
Optimized(local value numbering, Local Copy propagation, Local Constant folding, Dead Store Elimination) 2.37s

Local Register Allocation

After completing local value numbering, local copy propagation, local constant folding, and dead store elimination, we achieve a cleaner high-level code. In this refined code, all registers are assigned clear responsibilities, and redundant computations are eliminated. Our next step is to allocate machine registers for each virtual register, since the original low-level code generator allocated virtual registers in memory, which, compared to register-based operations, is considerably slower and can significantly impact the performance of the generated code.

Strategies

Our allocation strategy is divided into two parts: allocation for local variables and allocation for temporary values.

What Are Temporary Values?

Temporary values are stored in temporary registers for specific computations and typically have a short lifespan, often becoming obsolete immediately after the computation. Essentially, they act as computation helpers, temporarily holding values necessary to complete a computation. This is particularly important in low-level assembly code, which requires more operations to implement a computation than high-level program code.

Consider, for example, a computation in a C program like c = a * 8 + b. In this scenario, a, b, and c are local variables allocated to fixed virtual registers vr10, vr11, and vr12. However, to translate such a computation into low-level code, a temporary value, like vr20, is necessary to hold the result of computations temporarily. Given that the maximum number of operands in an ALU operation in low-level code is 2, it is impossible to complete the computation in low-level code without temporary registers.

1
2
3
4
mov_l $8, vr20
mul_l $vr10, vr20
add_l vr11, vr20
mov_l, vr20, vr12

What Are Local Variables?

Local variables are variables that are defined within a program. Our high-level register allocator assigns fixed virtual register numbers to these local variables.

image-20231211213847125

Machine Registers Allocation

According to the x86-64 Linux register use conventions, there are two categories of registers: caller-saved and callee-saved registers.

caller-saved registers: %r10, %r11, %rdi, %rsi, %rdx, %rcx, %r8, %r9

callee-saved registers: %rbx, %rbp, %r12, %r13, %13, %14, %15

We allocate caller-saved registers for temporary values and callee-saved registers for local variables.

Allocation for Temporary Values

For virtual registers not representing local variables, we allocate caller-saved registers if they have not been previously allocated. To achieve this, we maintain a register pool containing caller-saved registers. When a temporary virtual register is unallocated, we assign it a caller-saved machine register from this pool.

image-20231211221159109

However, the number of available caller-saved registers is finite. To allocate as many caller-saved machine registers as possible for temporary values, we need to find a way to reuse some of the caller-saved registers. This involves conducting liveness analysis again.

By leveraging liveness analysis, we can return machine registers that were allocated for a temporary value back to the machine register pool when the temporary value becomes dead.

image-20231211223637397

Allocation for local variables

We allocate callee-saved registers for local variables. The overall process is similar to the allocation for temporary values, with two key differences:

  • We have to compute ranks for each local variables and allocate callee-saved registers based on these ranks. The higher the rank of the variable, the higher its priority for allocation.
  • Unlike caller-saved registers, the values in callee-saved registers are expected to change across different functions. Therefore, if we allocate a callee-saved register for a local variable within a function, we need to save and recover the value of these callee-saved registers during the prologue and epilogue phases using push and pop instructions.

To calculate reasonable ranks for each variable, we have established a rule where the basic score of a variable reference is set to 10 raised to the power of n. In this context, ‘n’ represents the depth of the loop where the current variable reference is located. The total rank of a local variable will be the sum of the basic scores for each of its references.

image-20231211230640827

As previously discussed, the use of caller-saved registers requires saving and recovering operations to ensure these registers can be safely used across different functions.

image-20231211231139192

Improvement of Performance(LVN, Local Copy propagation, Local Constant folding, Dead Store Elimination, local register allocation)

Runtime result for example29.c

The execution time was reduced from 3.67 seconds to 1.40 seconds, which represents a performance improvement of approximately 162%.

Version Average running time(over 3 runs)
Unoptimized 3.67s
Optimized(local value numbering) 3.06s
Optimized(local value numbering, Local Copy propagation) 3.08s
Optimized(local value numbering, Local Copy propagation, Local Constant folding) 3.02s
Optimized(local value numbering, Local Copy propagation, Local Constant folding, Dead Store Elimination) 2.54s
Optimized(local value numbering, Local Copy propagation, Local Constant folding, Dead Store Elimination, local register allocation) 1.40s

Runtime result for example31.c

Same enhancement for example31.c, the execution time was reduced from 3.49 seconds to 1.33 seconds, involving an approximate increase in performance of 162%.

Version Average running time(over 3 runs)
Unoptimized 3.49s
Optimized(local value numbering) 3.06s
Optimized(local value numbering, Local Copy propagation) 3.06s
Optimized(local value numbering, Local Copy propagation, Local Constant folding) 3.07s
Optimized(local value numbering, Local Copy propagation, Local Constant folding, Dead Store Elimination) 2.37s
Optimized(local value numbering, Local Copy propagation, Local Constant folding, Dead Store Elimination, local register allocation) 1.33s

Low-level Optimization

After rounds of high-level optimizations discussed above, we now have an improved high-level representation language. The next step is to conduct peephole optimization to further optimize the low-level code.

Peephole optimization

Peephole optimization is a relatively straightforward optimization technique, which involves replacing fragments of intricate and inefficient instructions with more advanced and efficient ones. Since our target assembly language is X86-64, there are many advanced instructions available that can replace the inefficient ones generated by our low-level code generator.

Peephole optimization operates based on predefined rules or patterns. We can define various patterns to match the fragments of instructions that we wish to replace and identify corresponding better idioms that serve as more efficient alternative instructions.

The peephole optimization implemented in our compiler involves matchers, which match patterns in basic blocks and replace them with better idioms. Liveness analysis is also utilized to facilitate various changes.

Here, we will discuss the key patterns used in our peephole optimization:

1. Pattern for Array Computation

image-20231212001004287

2. Pattern for Conditional Jump

image-20231212001323371

3. Pattern for Array Index Computation

In some situations, array computations cannot be translated into the most optimized idioms due to certain target operands being active and non-eliminable. In such cases, we can select alternative idioms with the aid of liveness analysis.

image-20231212002815250

Improvement of Performance(LVN, Local Copy propagation, Local Constant folding, Dead Store Elimination, local register allocation, peephole optimization)

Runtime result for example29.c

The execution time was reduced from 3.67 seconds to 0.74 seconds, which represents a performance improvement of approximately 396%.

Version Average running time(over 3 runs)
Unoptimized 3.67s
Optimized(local value numbering) 3.06s
Optimized(local value numbering, Local Copy propagation) 3.08s
Optimized(local value numbering, Local Copy propagation, Local Constant folding) 3.02s
Optimized(local value numbering, Local Copy propagation, Local Constant folding, Dead Store Elimination) 2.54s
Optimized(local value numbering, Local Copy propagation, Local Constant folding, Dead Store Elimination, local register allocation) 1.40s
Optimized(local value numbering, Local Copy propagation, Local Constant folding, Dead Store Elimination, local register allocation, peephole optimization) 0.74s

Runtime result for example31.c

Same enhancement for example31.c, the execution time was reduced from 3.49 seconds to 0.86 seconds, involving an approximate increase in performance of 306%.

Version Average running time(over 3 runs)
Unoptimized 3.49s
Optimized(local value numbering) 3.06s
Optimized(local value numbering, Local Copy propagation) 3.06s
Optimized(local value numbering, Local Copy propagation, Local Constant folding) 3.07s
Optimized(local value numbering, Local Copy propagation, Local Constant folding, Dead Store Elimination) 2.37s
Optimized(local value numbering, Local Copy propagation, Local Constant folding, Dead Store Elimination, local register allocation) 1.33s
Optimized(local value numbering, Local Copy propagation, Local Constant folding, Dead Store Elimination, local register allocation, peephole optimization) 0.86s

Remaining inefficiencies

Although the performance of the generated code has improved significantly, there are still many remaining inefficiencies in the optimizations.

Just Local, We Can Do Global

During most of the optimization processes, including local value numbering, local copy propagation, and local constant folding, our optimization context is always focused on the basic blocks. However, if we leverage data flow analysis to conduct these optimizations globally, the quality of the generated code could be significantly enhanced.

Memory Values

Being able to track memory values would greatly improve our optimization capabilities. In our current approach to local value numbering, local copy propagation, local constant folding, and dead store elimination, we only focus on immediate values or values stored in registers. Memory reference values are not considered. Crafting reasonable algorithms to track memory values during compile time would undoubtedly enhance the quality of the generated code.

Optimization Order

After peephole optimization, numerous machine registers may be released and become available for reallocation. We can further improve this process by reallocating machine registers after the peephole optimization, which would significantly enhance the efficiency of machine register usage.

Conclusion

This comprehensive report has outlined the numerous optimization processes applied to our compiler, emphasizing both high-level and low-level optimizations. Through meticulous implementation and testing, we have successfully enhanced the performance of complex programs, as evidenced by our benchmarks with example29.c and example31.c.

The high-level optimizations, including local value numbering, local copy propagation, local constant folding, dead store elimination, and local register allocation, were instrumental in refining the high-level code. These optimizations not only streamlined the code by reducing redundancy but also significantly boosted the run-time efficiency. The integration of low-level peephole optimization further augmented this efficiency, showcasing the power of replacing inefficient instruction fragments with more effective alternatives.

Our results are striking. For instance, in example29.c, the execution time was dramatically reduced from 3.67 seconds to 0.74 seconds, demonstrating a 396% improvement. Similar improvements were observed in example31.c, where the execution time was cut down from 3.49 seconds to 0.86 seconds, marking a 306% increase in performance.

However, despite these significant advancements, there remain areas for further enhancement. The potential for global optimization, better memory value tracking, and improved optimization order highlight ongoing opportunities to refine our compiler’s performance even further.

Appendix

Test recording for example29.c

Unoptimized

image-20231210205520223

High-Level Optimization:

  • local value numbering

image-20231212015231111

  • local copy propagation

image-20231212020255237

  • local constant folding

image-20231212021019174

  • dead store elimination

image-20231212021853881

  • local register allocation

image-20231210210136272

Low-Level optimization

  • Peephole

image-20231210210455866

Gcc -o2

image-20231210210653394

Test recording for example31.c

Unoptimized

image-20231210211856860

High-Level Optimization:

  • local value numbering

image-20231212015728603

  • local copy propagation

image-20231212020528232

  • local constant folding

image-20231212021504539

  • dead store elimination

image-20231212022114005

  • local register allocation

image-20231210212035622

Low-Level opimization

  • Peephole

image-20231210212148848

Gcc -o2

image-20231210212533908