What every programmer should know about compiler optimizations

High-level programming languages offer many abstract programming constructs like functions, conditional statements, and loops that allow us to be incredibly productive. However, one drawback to writing code in a high-level programming language is the significant performance penalty. Ideally, you should write code that is easy to maintain and understandable, without compromising performance. For this reason, compilers try to automatically optimize code to improve its performance, and they currently have a very sophisticated way of doing this. They can transform loops, conditional statements, and recursive functions, remove entire blocks of code, and take advantage of the target instruction set architecture (ISA) to make code fast and compact. It is far better to focus on writing understandable code than to perform manual optimizations that result in code that is cryptic and difficult to maintain. In fact, manually optimizing your code might prevent the compiler from performing additional or more powerful optimizations.

Instead of manually optimizing your code, you should consider aspects of your code design, such as using faster algorithms, incorporating thread-level parallelism, and using framework-specific features (such as using move constructors).

This article is about Visual C++ compiler optimizations. I’ll explain the most important optimization techniques and the decisions a compiler must make before applying them. The goal is not to explain how to manually optimize your code, but rather to show why you can trust the compiler to optimize your code for you. This article is not a complete review of the optimizations that the Visual C++ compiler makes. However, it does show the optimizations you are interested in knowing about and how to communicate with the compiler to apply them.

There are other important optimizations that are beyond the current capabilities of any compiler, such as replacing an inefficient algorithm with an efficient one, or changing the design of a data structure to improve its locality. However, these optimizations are outside the scope of this article.

Defining compiler optimizations

An optimization is the process of transforming a piece of code into another functionally equivalent piece of code to improve one or more of its features. The two most important features are speed and code size. Other characteristics include the amount of power required to execute the code, the time required to compile the code, and, if the resulting code requires Just-in-Time (JIT) compilation, the time it takes to JIT compile the code.

Compilers are constantly getting better at the techniques they use to optimize code. However, they are not perfect. Still, instead of spending time manually fine-tuning a program, it is usually much more profitable to use certain features provided by the compiler and let the compiler fine-tune the code.

There are four ways to help the compiler optimize code more efficiently:

Write code that is easy to maintain and understandable. Don’t think of Visual C++’s object-oriented features as enemies of performance. The latest version of Visual C++ can minimize that overhead and in some cases eliminate it altogether.
Use compiler directives. For example, telling the compiler to use a function calling convention that is faster than the default.
Use compiler intrinsic functions. An intrinsic function is a special function whose implementation is automatically provided by the compiler. The compiler has a deep understanding of the function and replaces the function call with a very powerful sequence of instructions that takes advantage of the target ISA. The Microsoft .NET Framework does not currently support intrinsic functions, so none of the managed languages support them. However, Visual C++ has broad support for this feature. Note that while the use of intrinsic functions can improve the performance of your code, it reduces its readability and portability.
Use Profile Guided Optimization (PGO). With this technique, the compiler learns more about how the code behaves at run time and can optimize it accordingly.
The purpose of this article is to show you why you can trust the compiler with a demonstration of optimizations performed on inefficient but understandable code (using the first method). In addition, I give a brief introduction to profile-guided optimization and mention some of the compiler directives that allow you to tweak some parts of your code.

There are many compiler optimization techniques that range from simple transformations, such as constant folding, to extreme transformations, such as instruction programming. However, in this article, I will limit the content to some of the most important optimizations: those that can significantly improve performance (by double-digit percentage) and reduce code size: function injection, COMDAT optimizations, and optimizations. of loops. I’ll cover the first two in the next section, and then show how you can control the optimizations that Visual C++ performs. Lastly, I’ll take a look at optimizations in the .NET Framework. In this article I will use Visual Studio 2013 to compile the code.

Link-time code generation

Link Time Code Generation (LTCG) is a technique for performing full program optimizations (WPOs) on C or C++ code. The C or C++ compiler compiles each source code file separately and generates the corresponding object file. This means that the compiler can only apply optimizations to a single source file and not to the entire program. However, some important optimizations can only be done on the entire program. You can apply these optimizations at link time instead of compile time because the linker has a complete view of the program.

When LTCG is enabled (by specifying the /GL compiler switch), the compiler driver (cl.exe) invokes only the compiler front-end (c1.dll or c1xx.dll) and postpones back-end work (c2.dll) up to the time of the link. The resulting object files contain C Intermediate Language (CIL) instead of machine-dependent assembly code. Next, when the linker (link.exe) is invoked, it will see that the object files contain CIL code, and it will invoke the compiler backend, which, in turn, performs WPO, generates the binary object files, and returns to the linker to merge all the object files and generate the executable file.

The front-end actually performs some optimizations, such as constant folding, regardless of whether optimizations are enabled or disabled. However, all major optimizations are performed by the compiler back-end and can be controlled by compiler switches.

LTCG allows the back-end to perform many optimizations aggressively (by specifying /GL along with the /O1 or /O2 and /Gw compiler switches and the /OPT:REF and /OPT:ICF linker switches). In this article, I will discuss only COMDAT function injection and optimizations. For a full list of LTCG optimizations, see the documentation. Note that the linker can perform LTCG on native object files, mixed native/managed object files, pure managed object files, secure managed object files, and secure .netmodules.

I will create a program consisting of two source code files (source1.c and source2.c) and a header file (source2.h). The files source1.c and source2.c are shown in Figure 1 and Figure 2 , respectively. The header file, which contains the prototypes for all the functions in source2.c, is fairly simple, so it’s not shown here.

Figure 1 The source1.c File

XML

Copy
#include <stdio.h> // scanf_s and printf.
#include “Source2.h”
int square(int x) { return x*x; }
main() {
int n = 5, m;
scanf_s(“%d”, &m);
printf(“The square of %d is %d.”, n, square(n));
printf(“The square of %d is %d.”, m, square(m));
printf(“The cube of %d is %d.”, n, cube(n));
printf(“The sum of %d is %d.”, n, sum(n));
printf(“The sum of cubes of %d is %d.”, n, sumOfCubes(n));
printf(“The %dth prime number is %d.”, n, getPrime(n));
}
Figure 2 The source2.c File

XML

Copy
#include <math.h> // sqrt.
#include <stdbool.h> // bool, true and false.
#include “Source2.h”
int cube(int x) { return x*x*x; }
int sum(int x) {
int result = 0;
for (int i = 1; i <= x; ++i) result += i;
return result;
}
int sumOfCubes(int x) {
int result = 0;
for (int i = 1; i <= x; ++i) result += cube(i);
return result;
}
static
bool isPrime(int x) {
for (int i = 2; i <= (int)sqrt(x); ++i) {
if (x % i == 0) return false;
}
return true;
}
int getPrime(int x) {
int count = 0;
int candidate = 2;
while (count != x) {
if (isPrime(candidate))
++count;
}
return candidate;
}
The source1.c file contains two functions: the square function, which takes an integer and returns its square, and the program’s main function. The main function calls the square function and all the functions in source2.c except isPrime. The source2.c file contains five functions: the cube function returns the cube of a given integer; the sum function returns the sum of all integers from 1 to a given integer; the sumOfcubes function returns the sum of the cubes of all integers from 1 to a given integer; the isPrime function determines whether a given integer is prime; and the getPrime function, which returns the prime number x. I have omitted error checking because it is not relevant to this article.

The code is simple, but very useful. There are a number of functions that perform simple calculations; some require simple for loops. The getPrime function is the most complex because it contains a while loop, and inside the loop, it calls the isPrime function, which also contains a loop. I’ll use this code to show one of the most important compiler optimizations, known as function injection, and a few other optimizations.

I will compile the code with three different configurations and analyze the results to determine how the compiler has transformed it. If you want to follow the directions step by step, you will need the assembler output file (generated with the /FA[s] compiler switch) to examine the resulting assembly code, and the mapping file (generated with the /MAP linker switch). ) to determine what COMDAT optimizations have been made (the linker can also report this if you use the /verbose:icf and /verbose:ref switches). Make sure these switches are specified in all the settings I discuss below. Also, I will use the C compiler (/TC) to make it easier to examine the generated code. However, everything I discuss in this article also applies to C++ code.

The debug configuration

The debug configuration is used primarily because all back-end optimizations are disabled when the /Od compiler switch is specified without specifying the /GL switch. By compiling the code with this configuration, the resulting object files will contain binary code that exactly corresponds to the source code. You can examine the resulting assembler output files and the mapping file to confirm this. These settings are equivalent to the Visual Studio debug settings.

The Release setting for compile-time code generation
This configuration is similar to the Release configuration in that optimizations are enabled (by specifying the /O1, /O2, or /Ox compiler switch), but without specifying the /GL compiler switch. With this setting, the resulting object files will contain optimized binary code. However, no program-wide optimizations are performed.

If you examine the list file of the generated assembly of source1.c, you will notice that two optimizations have been performed. First, the first square function call, square(n), in Figure 1it has been completely removed when evaluating the computation at compile time. How could it have happened? The compiler determines that the square function is small, so it must be inlined. After the insert, the compiler determines that the value of the local variable n is known and does not change between the assignment statement and the function call. Therefore, it concludes that it is safe to perform the multiplication and substitutes the result (25). In the second optimization, the second call to the square function, square(m), is also inserted as inline. However, since the value m is not known at compile time, the compiler cannot evaluate the computation, so the actual code is emitted.

We will now examine the source2.c assembly list file, which is much more interesting. The cube function call in sumOfCubes has been inlined. In turn, this has allowed the compiler to make significant optimizations to the loop (as you’ll see in the “Loop Optimizations” section). Also, the SSE2 instruction set is used in the isPrime function to convert from int to double when calling the sqrt function, and also to convert from double to int when returning from sqrt. And sqrt is called only once before the loop starts. Note that if the /arch switch is not specified in the compiler, the x86 compiler uses SSE2 by default. Most widely implemented x86 processors, as well as all x86-64 processors, support SSE2.

The Release setting for link-time code generation

The LTCG Release configuration is identical to the Visual Studio Release configuration. With this setting, optimizations are enabled and the /GL compiler switch is specified. This modifier is implicitly defined when /O1 or /O2 is used. Instructs the compiler to output CIL object files instead of assembly object files. In this way, the linker invokes the compiler backend to perform WPO as described above. We will now discuss some WPO optimizations to show the immense advantage of LTCG. Assembly code listings that have been generated with this configuration are available online.

As long as function inclusion is enabled (/Ob, which is triggered whenever optimizations are requested), the /GL switch allows the compiler to inline functions defined in other translation units, regardless of whether the switch has been specified. from the /Gy compiler (explained later). The /LTCG linker switch is optional and provides instructions for the linker only.

If you examine the assembly listings file for source1.c, you can see that all function calls except scanf_s are inlined. As a result, the compiler has been able to perform the cube, sum, and sumOfCubes calculations. Only the isPrime function has not been inserted. However, if it was manually inserted into getPrime, the compiler would still insert getPrime into main.

As you can see, function injection is important not only because it optimizes a function call, but also because it allows the compiler to perform many other optimizations as a result. Inserting a function often improves performance at the cost of increasing code size. Excessive use of this optimization leads to a phenomenon called code bloat. At each call site, the compiler performs a cost-benefit analysis, and then decides whether to insert the function.

Because of the importance of inserting, the Visual C++ compiler provides much more support for insert control than the standard dictates. You can tell the compiler to never inline a series of functions by using the auto_inline pragma. You can tell the compiler to never inline a specific function or method by marking it with __declspec(noinline). You can mark a function with the inline keyword to provide a hint for the compiler to inline the function (although the compiler may choose to ignore this hint if the inline would be a net loss). The inline keyword has been available since the first version of C++, which was introduced in C99. You can use the Microsoft-specific keyword __inline in C and C++ code; it is useful when using an old version of C that does not support this keyword. Additionally, you can use the __forceinline keyword (C and C++) to force the compiler to inline a function whenever possible. And last but not least, you can tell the compiler to unfold a recursive function to a specified or undefined depth, by inserting it with the inline_recursion pragma. Note that the compiler does not currently provide features that allow you to control the insert at the call site rather than at the function definition. You can tell the compiler to unfold a recursive function to a specified or undefined depth, by inserting it with the inline_recursion pragma. Note that the compiler does not currently provide features that allow you to control the insert at the call site rather than at the function definition. You can tell the compiler to unfold a recursive function to a specified or undefined depth, by inserting it with the inline _recursion pragma. Note that the compiler does not currently provide features that allow you to control the insert at the call site rather than at the function definition.

The /Ob0 switch completely disables embedding, which works by default. You should use this switch when debugging (it’s automatically specified in the Visual Studio debug configuration). The /Ob1 switch tells the compiler to only consider inline functions that are marked with inline, __inline, or __forceinline. The /Ob2 switch, which acts when /O[1|2|x] is specified, indicates that the compiler should consider any functions for insertion. In my opinion, the only reason to use the inline or __inline keywords is to control insertion with the /Ob1 switch.

The compiler will not be able to insert a function under certain conditions. An example is when virtually calling a virtual function; the function cannot be injected since the compiler cannot know which function is to be called. Another example is when a function is called through a pointer to the function instead of using its name. Be careful to avoid these conditions to enable push. See the MSDN documentation for a complete list of conditions.

Function injection is not the only optimization that is most effective when applied at the program-wide level. In fact, most optimizations are more efficient at that level. In the rest of this section, I’ll talk about a specific class of these optimizations called COMDAT optimizations.

By default, when a translation unit is compiled, all code will be stored in a single section in the resulting object file. The linker works at the section level. That is, you can remove sections, combine sections, and rearrange sections. This prevents the linker from performing three optimizations that can significantly reduce (double digit percentage) the size of the executable file and improves its performance. The first is the removal of unreferenced global variables and functions. The second is the folding of constant global variables and identical functions.

To enable these linker optimizations, you can tell the compiler to package functions and variables into separate sections by specifying the /Gy (function-level linking) and /Gw (global data optimization) compiler switches, respectively. These sections are called COMDAT. You can also mark a specific global data variable with __declspec(selectany) to indicate that the compiler packs the variable into a COMDAT. Then, by specifying the /OPT:REF linker switch, the linker will remove unreferenced functions and global variables. Also, if the /OPT:ICF switch is specified, the linker will double global constant variables and identical functions. (ICF stands for identical COMDAT folds). With the /ORDER linker switch, you can tell the linker to place COMDAT on the resulting image in a specific order. Note that all of these optimizations are linker optimizations and do not require the /GL compiler switch. The /OPT:REF and /OPT:ICF switches should be disabled while debugging for obvious reasons.

You should use /LTCG whenever possible. The only reason not to use /LTCG is because you want to distribute the resulting object and library files. Remember that these files contain CIL code rather than assembly code. CIL code can only be used by the compiler and linker of the same version that generated it, which can severely limit the usability of object files because developers must have the same version of the compiler to use these files. In this case, unless you are willing to distribute the object files for each version of the compiler, you should use compile-time code generation instead. In addition to limited usability, these object files are much larger than the corresponding assembler object files. Nevertheless,

Loop optimizations

The Visual C++ compiler supports several loop optimizations, but I’ll discuss just three: loop reversal, automatic vectorization, and invariant code movement in the loop. If you modify the code in Figure 1 so that ma sumOfCubes is passed instead of n, the compiler will not be able to determine the value of the parameter, so you must compile the function to handle any arguments. The resulting function is highly optimized and its size is large enough that the compiler will not insert it.

Compiling the code with the /O1 switch results in assembly code that is optimized for space. In this case, no optimizations will be performed on the sumOfCubes function. Compiling with the /O2 switch results in code optimized for speed. The code size will be much larger, but significantly faster, because the loop within sumOfCubes has been reversed and vectorized. It is important to understand that vectorization would not be possible without the insertion of the cube function. Also, loopback would not be as efficient without the insert. A simplified graphical representation of the resulting assembly code is shown in Figure 3 . The flow chart is the same for the x86 and x86-64 architectures.

Figure 3 sumOfCubes Control Flow Chart

In figure 3, the green diamond is the entry point and the red rectangles are the exit points. The blue diamonds represent the conditions that are executed as part of the sumOfCubes function at run time. If SSE4 is supported by the processor and x is greater than or equal to eight, the SSE4 instructions will be used to perform four multiplications at a time. The process of performing the same operation simultaneously on multiple values is called vectorization. Also, the compiler reverses the loop twice; that is, the body of the loop will iterate twice on each iteration. The combined effect is that eight multiplications will be performed in each iteration. When x is less than eight, the traditional instructions will be used to execute the rest of the calculations. Note that the compiler has emitted three exit points containing separate epilogs to the function instead of just one. This reduces the number of jumps.

Loop reversal is the process of repeating the loop body within the loop so that more than one loop iteration is executed in a single loopback iteration. The reason this improves performance is that loop control statements will be executed less frequently. And perhaps more importantly, it could allow the compiler to do many other optimizations, such as vectorization. The downside of rolling back is that it increases log pressure and code size. However, depending on the body of the loop, it could improve performance by a double-digit percentage.

Unlike x86 processors, all x86-64 processors support SSE2. Additionally, you can take advantage of the AVX/AVX2 instruction sets of the latest x86-64 microarchitectures from Intel and AMD by specifying the /arch switch. The AVX2 specification allows the compiler to also use the FMA and IMC instruction sets.

Currently, the Visual C++ compiler does not allow you to control loop reversal. However, you can emulate this technique using templates in conjunction with the __forceinline keyword. You can disable automatic vectorization in a particular loop using the loop pragma with the no_vector option.

Examining the generated assembly code, the more astute will realize that the code can be further optimized. However, the compiler has already done a lot of work and we won’t spend much more time analyzing the code and applying minor optimizations.

someOfCubes is not the only function whose loop has been rolled back. If you modify your code so that m is passed to the sum function instead of n, the compiler will not be able to evaluate the function and therefore must emit your code. In this case, the loop will be reversed twice.

The last optimization I’ll discuss is invariant code movement in the loop. Consider the following code snippet:

XML

Copy
int sum(int x) {
int result = 0;
int count = 0;
for (int i = 1; i <= x; ++i) {
++count;
result += i;
}
printf(“%d”, count);
return result;
}
The only change here is that I have an additional variable that is incremented on each iteration and then printed. It’s not hard to see that this code can be optimized by moving the increment of the counter variable out of the loop. That is, I can only assign x to the counter variable. This optimization is called invariant code movement in the loop. The loop invariant part clearly indicates that this technique only works when the code does not depend on any of the expressions in the loop header.

Here’s the kicker: If you manually apply this optimization, the resulting code might exhibit reduced performance under certain conditions. Can you see why? Note what happens when x is not positive. The loop is never executed, which means that in the unoptimized version, the variable’s counter is not touched. In either case, in the manually optimized version, an unnecessary assignment of x to the counter is executed outside the loop. Also, if x were negative, the counter would contain the wrong value. Both people and compilers are susceptible to these problems. Fortunately, the Visual C++ compiler is smart enough to figure this out by issuing the loop condition before the assignment,

In short, if you’re not a compiler or expert in compiler optimizations, you should avoid doing manual transformations on your code just to make it look faster. Relax and trust the compiler to optimize your code.

Control of optimizations

In addition to the /O1, /O2, and /Ox compiler switches, you can control optimizations of specific functions using the optimize pragma, which looks like this:

XML

Copy
#pragma optimize( “[optimization-list]”, {on | off} )
The optimization list can be empty or contain one or more of the following values: g, s, te and . These correspond to the /Og, /Os, /Ot, and /Oy compiler switches, respectively.

An empty list with the off parameter will cause all of these optimizations to be disabled, regardless of what compiler switches have been specified. An empty list with the on parameter will cause the specified compiler switches to act.

The /Og switch enables global optimizations, which are optimizations that can be performed by examining only the function being optimized, not any of the functions it calls. If LTCG is enabled, /Og enables WPO.

The optimize pragma is useful when you want different functions to be optimized in different ways: some for space and some for speed. However, if you want that level of control, you should consider profile-guided optimization (PGO), which is the process of optimizing your code using a profile that contains recorded behavioral information while running an instrumented version of your code. The compiler uses the profile to make better decisions about how to optimize the code. Visual Studio provides the necessary tools to apply this technique in native and managed code.

Optimizations in .NET

There is no linker involved in the .NET build model. However, there is a source code compiler (C# compiler) and a JIT compiler. The source code compiler performs only minor optimizations. For example, not performing function injection and loop optimizations. Instead, these optimizations are handled by the JIT compiler. The JIT compiler that ships with all versions of the .NET Framework up to 4.5 does not support SIMD statements. However, the JIT compiler that ships with the .NET Framework 4.5.1 and later, called RyuJIT, does support SIMD.

What is the difference between RyuJIT and Visual C++ in terms of optimization capabilities? Because it does its work at runtime, RyuJIT can perform optimizations that Visual C++ cannot. For example, at run time, RyuJIT might be able to determine that the condition of an if statement is never true in this particular execution of the application, and can therefore be optimized. Similarly, RyuJIT can take advantage of the capabilities of the processor it is running on. For example, if the processor supports SSE4.1, the JIT compiler will only issue SSE4.1 instructions for the sumOfcubes function, so the generated code will be much more compact. However, you can’t spend a lot of time optimizing the code, since the time spent on JIT compilation has an impact on the performance of the application. On the other hand, the Visual C++ compiler can spend much more time detecting and taking advantage of other optimization opportunities. An exciting new technology from Microsoft, known as .NET Native, allows you to compile managed code into optimized self-contained executables using the Visual C++ backend. Currently, this technology only supports apps from the Windows Store.

Currently, the ability to control managed code optimizations is limited. The C# and Visual Basic compilers only provide the ability to turn optimizations on or off using the /optimize switch. To control JIT optimizations, you can apply the System.Runtime.CompilerServices.MethodImpl attribute on a method with a MethodImplOptions option specified. The NoOptimization option turns off optimizations, the NoInlining option prevents the method from inlining, and the AggressiveInlining option (.NET 4.5) provides a recommendation (more than just a suggestion) for the JIT compiler to inline the method.

Summary

All of the optimization techniques described in this article can significantly improve code performance by a double-digit percentage, and all of them are supported by the Visual C++ compiler. What makes these techniques important is that, when applied, they allow the compiler to perform other optimizations. This is by no means a complete explanation of the compiler optimizations that Visual C++ performs. However, I hope I have given you an appreciation of the compiler’s capabilities. Visual C++ can do more, much more, so stay tuned for part 2.

Hadi Brais is a PhD researcher at the Indian Institute of Technology in Delhi (IITD), where he researches compiler optimizations for next-generation memory technology. He spends most of his time writing code in C, C++, and C#, and delving into the CLR and CRT. He has a blog at hadibrais.wordpress.com . You can contact him at hadi.b@live.com email address .