A Handbook on Writing Better C Programs with GNU CC
Contents
- Notes On C Programming For Better Code
- The GNU C Compiler
- GCOV: A Test Coverage Program
- Measuring Code Performance And More
1. Notes On C Programming For Better Code
This section contains some techniques that are usually useful in producing better code while programming in the C language. These techniques help in the optimization process in the sense that they often yield better code to begin with, which is either reasonably optimal, or lends itself to further tuning more easily.
Table Lookup
Sometimes it is useful to pre-compute a table of often used functions (the trigonometric functions are examples), and use an offset into the table instead of actually computing the value of the function. This is an often used trick in graphics programs. Note that if the table happens to be rather large, then the gains will saturate at some point. It is also a possibility that one pre-computes the more often used values only and the rest of the values are computed normally - an idea that is similar to that of the cache (here we are sort of caching the values of the function).
Inlining
Though calling conventions differ across various languages,
there is invariably an overhead. Functions that are small and frequently
called are most prone to this overhead. Inlining such functions is usually
a good idea. Inlining leads to an increase in code size, so one must be wary
of that. Moreover, macros may have side effects, and the problems may be
hard to trace if the macro is called with arguments such as *p++ when
the corresponding argument appears more than once in the macro. Moreover,
code profilers don't see macros.
Strength Reduction
This is a technique wherein a complex expression is replaced by a simpler (easier to compute) expression whose result is the same. Following are a few examples.
| Costlier Expression | Cheaper Expression |
|---|---|
x = y / z + w / z |
x = (y + w) / z |
x = pow(y, 2.0) |
x = y * y |
x = y * 63 |
x = (y << 6) - y |
x = y % 8 |
x = y & 7 |
for (i = 0; i < 10; i++) {
|
for (i = 0; i < 100; i += 10) {
|
#define abs(x) (((x>0)?(x):-(x)) |
static long abs (long x) {
|
Unrolling Loops
Consider the following loop:
for (i = 0; i < 1000; i++) { loop_code(i); }
An unrolled loop looks like the following:
for (i = 0; i <) { loop_code(i); i++; loop_code(i);
i++; loop_code(i); i++; loop_code(i); i++; loop_code(i); i++; loop_code(i);
i++; loop_code(i); i++; loop_code(i); i++; loop_code(i); i++; loop_code(i);
i++; }
The loop will be executed only 100 times, instead of the original 1000 times. If the upper limit is not so big a number (1000 is big), then the loop can be unrolled completely, which will fully eliminate the branching and testing. Loop unrolling may not always be beneficial: some computations may converge on some result, and anyway an unrolled loop is more code - so code bloat is an undesirable side-effect to watch for.
Jam the Loops
If there are adjacent loops that loop over the same range of a single variable, then it is obvious that the loops be combined. This will cause the incrementing and testing of the loop variable to be done half as often. Of course, this is one of the most obvious 'optimizations'.
Loop Inversion
On most machines, there is an instruction for decrementing and testing against 0 (the x86 is an example of such a machine). So, instead of the following:
for (i = 0; i <=
N; i++) { loop_code(i); }
it is better to use the following code:
... for (; i-- >= 0; ) { loop_code(i); }
Loop Invariant Code
That portion of a loop-code that does not use the loop variable, or has no side effects upon the same, should be moved out of the loop. Also consider something like:
... /* somewhere in
a loop */ sum = struct_array[i].x + struct_array[i].y + \ struct_array[i].z;
it is usually better to use:
... p = struct_array[i];
sum = p.x + p.y + p.z;
CSE: Common Subexpression Elimination
Consider a computation that looks like the following:
p = x / y * z; q = x * w / y;
Here x / y is a common subexpression that doesn't need to be calculated twice.
Keeping Variables in Registers
Functions that take pointers to variables are arguments are potentially capable of changing the value of the variables. Then the optimizer will not leave the variable's value in a register. Thus it is usually better not to pass references to variables unless so desired.
Being Friendly with the Stack
Avoid having large arrays as local variables. Instead, have global allocation for them, if possible. The space for such large variables can come from the heap instead of the stack (there are machines on which large stacks are a problem, though it is not the case with the x86).
Consider another case:
int function_foo()
{ int a, b, c; do_something(a, b, c); if (condition_xyz) return another_function();
else return 1; }
Since function_foo()'s last statement is to
call another_function() and function_foo() has
no need of its variables after that point, it can remove its stack frame.
When another_function() is done it returns directly to function_foo()'s
caller. This will reduce the maximum depth of the stack, and it also saves
the execution of some return-from-subroutine code as it will get
executed only once instead of twice (or more, depending on how deeply the
function results are passed along.)
Locality of Reference
It is common practice to exploit locality of reference while accessing, say, multi-dimensional arrays. For example, if one has to loop through the elements of a 2-dimensional array, then the inner loop should run over the rightmost index.
Branch Probabilities
When it happens that there is a high probability that a condition will be satisfied one way or the other, then one can attempt to reduce misprediction of the branch (which may be quite costly in terms of processor cycles in modern-day deeply pipelined processors). Let us consider an example:
if (condition) { case
a; } else { case b; }
One might try to rewrite the above code as follows:
case b; if (condition)
{ undo_effects(case b); case b; }
Of course, this makes sense if it is possible to undo the
effects of case b. If it is, this technique has the advantage
that the jump taken case can be optimized according to the conditional probability
and undo_effects(case b); case a; might be merged together to
be more optimal than executing each separately.
2. The GNU C Compiler
GCC (the GNU C Compiler) is an optimizing compiler with frontends for languages like C, C++, Objective-C, Ada 9X, Fortran, Modula-3 and Pascal. GCC is eminently portable, and is available "freely", under the terms of the GPL (GNU Public License).
GCC has a large number of extensions to the usual language families (for example, it provides several language features not found in ANSI standard C), which makes it a very flexible and powerful compiler. This section discusses program optimization while using GCC.
Portions from GCC documentation follow (almost verbatim):
Options That Control Optimization
These options control various sorts of optimizations:
`-O'
`-O1'
Optimize. Optimizing compilation takes somewhat more time, and a lot more memory for a large function.
Without `-O', the compiler's goal is to reduce the cost
of compilation and to make debugging produce the expected results. Statements
are independent: if you stop the program with a breakpoint between statements,
you can then assign a new value to any variable or change the program counter
to any other statement in the function and get exactly the results you
would expect from the source code.
Without `-O', the compiler only allocates variables declared `register'
in registers.
With `-O', the compiler tries to reduce code size and execution time.
When you specify `-O', the compiler turns on `-fthread-jumps' and `-fdefer-pop' on all machines. The compiler turns on `-fdelayed-branch' on machines that have delay slots, and `-fomit-frame-pointer' on machines that can support debugging even without a frame pointer. On some machines the compiler also turns on other flags.
`-O2'
Optimize even more. GNU CC performs nearly all supported optimizations that do not involve a space-speed tradeoff. The compiler does not perform loop unrolling or function inlining when you specify `-O2'. As compared to `-O', this option increases both compilation time and the performance of the generated code.
`-O2' turns on all optional optimizations except for loop unrolling and function inlining. It also turns on the `-fforce-mem' option on all machines and frame pointer elimination on machines where doing so does not interfere with debugging.
`-O3'
Optimize yet more. `-O3' turns on all optimizations specified by `-O2' and also turns on the `inline-functions' option.
`-O0'
Do not optimize.
`-Os'
Optimize for size. `-Os' enables all `-O2' optimizations that do not typically increase code size. It also performs further optimizations designed to reduce code size.
If you use multiple `-O' options, with or without level numbers, the last such option is the one that is effective.
Options of the form `-fFLAG' specify machine-independent flags.
Most flags have both positive and negative forms; the negative form of `-ffoo' would be `-fno-foo'. In the table below, only one of the forms is listed -- the one which is not the default. You can figure out the other form by either removing `no-' or adding it.
`-ffloat-store'
Do not store floating point variables in registers, and inhibit other options that might change whether a floating point value is taken from a register or memory.
This option prevents undesirable excess precision on machines where the floating registers keep more precision than a `double' is supposed to have (for example, the x86 architecture).
`-fno-default-inline'
Do not make member functions inline by default merely because they are defined inside the class scope (C++ only). Otherwise, when you specify `-O', member functions defined inside class scope are compiled inline by default; i.e., you don't need to add `inline' in front of the member function name.
`-fno-defer-pop'
Always pop the arguments to each function call as soon as that function returns. For machines which must pop arguments after a function call, the compiler normally lets arguments accumulate on the stack for several function calls and pops them all at once.
`-fforce-mem'
Force memory operands to be copied into registers before doing arithmetic on them. This produces better code by making all memory references potential common subexpressions. When they are not common subexpressions, instruction combination should eliminate the separate register-load. The `-O2' option turns on this option.
`-fforce-addr'
Force memory address constants to be copied into registers before doing arithmetic on them. This may produce better code just as `-fforce-mem' may.
`-fomit-frame-pointer'
Don't keep the frame pointer in a register for functions that don't need one. This avoids the instructions to save, set up and restore frame pointers; it also makes an extra register available in many functions.
`-fno-inline'
Don't pay attention to the `inline' keyword. Normally this option is used to keep the compiler from expanding any functions inline. Note that if you are not optimizing, no functions can be expanded inline.
`-finline-functions'
Integrate all simple functions into their callers. The compiler heuristically decides which functions are simple enough to be worth integrating in this way.
If all calls to a given function are integrated, and the function is declared `static', then the function is normally not output as assembler code in its own right.
`-fkeep-inline-functions'
Even if all calls to a given function are integrated, and the function is declared `static', nevertheless output a separate run-time callable version of the function. This switch does not affect `extern inline' functions.
`-fkeep-static-consts'
Emit variables declared `static const' when optimization isn't turned on, even if the variables aren't referenced.
GNU CC enables this option by default. If you want to force the compiler to check if the variable was referenced, regardless of whether or not optimization is turned on, use the `-fno-keep-static-consts' option.
`-fno-function-cse'
Do not put function addresses in registers; make each instruction that calls a constant function contain the function's address explicitly.
This option results in less efficient code, but some strange hacks that alter the assembler output may be confused by the optimizations performed when this option is not used.
`-ffast-math'
This option allows GCC to violate some ANSI or IEEE rules and/or specifications in the interest of optimizing code for speed. For example, it allows the compiler to assume arguments to the `sqrt' function are non-negative numbers and that no floating-point values are NaNs.
This option should never be turned on by any `-O' option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ANSI rules/specifications for math functions.
The following options control specific optimizations. The `-O2' option turns on all of these optimizations except `-funroll-loops' and `-funroll-all-loops'. On most machines, the `-O' option turns on the `-fthread-jumps' and `-fdelayed-branch' options, but specific machines may handle it differently.
One can use the following flags in the rare cases when "fine-tuning" of optimizations to be performed is desired.
`-fstrength-reduce'
Perform the optimizations of loop strength reduction and elimination of iteration variables.
`-fthread-jumps'
Perform optimizations where we check to see if a jump branches to a location where another comparison subsumed by the first is found. If so, the first branch is redirected to either the destination of the second branch or a point immediately following it, depending on whether the condition is known to be true or false.
`-fcse-follow-jumps'
In common subexpression elimination, scan through jump instructions when the target of the jump is not reached by any other path. For example, when CSE encounters an `if' statement with an `else' clause, CSE will follow the jump when the condition tested is false.
`-fcse-skip-blocks'
This is similar to `-fcse-follow-jumps', but causes CSE to follow jumps which conditionally skip over blocks. When CSE encounters a simple `if' statement with no else clause, `-fcse-skip-blocks' causes CSE to follow the jump around the body of the `if'.
`-frerun-cse-after-loop'
Re-run common subexpression elimination after loop optimizations has been performed.
`-frerun-loop-opt'
Run the loop optimizer twice.
`-fexpensive-optimizations'
Perform a number of minor optimizations that are relatively expensive.
`-fdelayed-branch'
If supported for the target machine, attempt to reorder instructions to exploit instruction slots available after delayed branch instructions.
`-fschedule-insns'
If supported for the target machine, attempt to reorder instructions to eliminate execution stalls due to required data being unavailable. This helps machines that have slow floating point or memory load instructions by allowing other instructions to be issued until the result of the load or floating point instruction is required.
`-fschedule-insns2'
Similar to `-fschedule-insns', but requests an additional pass of instruction scheduling after register allocation has been done. This is especially useful on machines with a relatively small number of registers and where memory load instructions take more than one cycle.
`-ffunction-sections'
Place each function into its own section in the output file if the target supports arbitrary sections. The function's name determines the section's name in the output file.
Use this option on systems where the linker can perform optimizations to improve locality of reference in the instruction space.
One should only use this option when there are significant benefits from doing so. When we specify this option, the assembler and linker will create larger object and executable files and will also be slower. We will not be able to use `gprof' on all systems if we specify this option and we may have problems with debugging if we specify both this option and `-g'.
`-fcaller-saves'
Enable values to be allocated in registers that will be clobbered by function calls, by emitting extra instructions to save and restore the registers around such calls. Such allocation is done only when it seems to result in better code than would otherwise be produced.
This option is enabled by default on certain machines, usually those which have no call-preserved registers to use instead.
`-funroll-loops'
Perform the optimization of loop unrolling. This is only done for loops whose number of iterations can be determined at compile time or run time. `-funroll-loop' implies both `-fstrength-reduce' and `-frerun-cse-after-loop'.
`-funroll-all-loops'
Perform the optimization of loop unrolling. This is done for all loops and usually makes programs run more slowly. `-funroll-all-loops' implies `-fstrength-reduce' as well as `-frerun-cse-after-loop'.
`-fmove-all-movables'
Forces all invariant computations in loops to be moved outside the loop.
`-freduce-all-givs'
Forces all general-induction variables in loops to be strength-reduced.
These options may generate better or worse code; results are highly dependent on the structure of loops within the source code.
These two options are intended to be removed someday, once they have helped determine the efficacy of various approaches to improving loop optimizations.
`-fno-peephole'
Disable any machine-specific peephole optimizations.
`-fbranch-probabilities'
After running a program compiled with `-fprofile-arcs' one can compile it a second time using `-fbranch-probabilities', to improve optimizations based on guessing the path a branch might take.
With `-fbranch-probabilities', GNU CC puts a `REG_EXEC_COUNT' note on the first instruction of each basic block, and a `REG_BR_PROB' note on each `JUMP_INSN' and `CALL_INSN'. These can be used to improve optimization.
`-fregmove'
Some machines only support 2 operands per instruction. On such machines, GNU CC might have to do extra copies. The `-fregmove' option overrides the default for the machine to do the copy before register allocation.
Intel x86 Options
These `-m' options are defined for the ix86 family of computers:
`-mcpu=CPU TYPE'
Assume the defaults for the machine type CPU TYPE when scheduling instructions. The choices for CPU TYPE are: `i386', `i486', `i586' (`pentium'), `pentium', `i686' (`pentiumpro') and `pentiumpro'. While picking a specific CPU TYPE will schedule things appropriately for that particular chip, the compiler will not generate any code that does not run on the i386 without the `-march=CPU TYPE' option being used.
`-march=CPU TYPE'
Generate instructions for the machine type CPU TYPE. The choices for CPU TYPE are: `i386', `i486', `pentium', and `pentiumpro'. Specifying `-march=CPU TYPE' implies `-mcpu=CPU TYPE'.
`-m386'
`-m486'
`-mpentium'
`-mpentiumpro'
Synonyms for -mcpu=i386, -mcpu=i486, -mcpu=pentium, and -mcpu=pentiumpro respectively.
`-mieee-fp'
`-mno-ieee-fp'
Control whether or not the compiler uses IEEE floating point comparisons. These handle correctly the case where the result of a comparison is unordered.
`-msoft-float'
Generate output containing library calls for floating point. Note that these libraries are not a part of the GNU C compiler.
Normally the facilities of the machine's usual C compiler are used, but this can't be done directly in cross-compilation. You must make your own arrangements to provide suitable library functions for cross-compilation.
On machines where a function returns floating point results in the 80387 register stack, some floating point opcodes may be emitted even if `-msoft-float' is used.
`-mno-fp-ret-in-387'
Do not use the FPU registers for return values of functions.
The usual calling convention has functions return values of types `float' and `double' in an FPU register, even if there is no FPU. The idea is that the operating system should emulate an FPU.
The option `-mno-fp-ret-in-387' causes such values to be returned in ordinary CPU registers instead.
`-mno-fancy-math-387'
Some 387 emulators do not support the `sin', `cos' and `sqrt' instructions for the 387. Specify this option to avoid generating those instructions. This option is the default on FreeBSD.
`-malign-double'
`-mno-align-double'
Control whether GNU CC aligns `double', `long double', and `long long' variables on a two word boundary or a one word boundary. Aligning `double' variables on a two word boundary will produce code that runs somewhat faster on a `Pentium' at the expense of more memory.
`-mno-wide-multiply'
`-mwide-multiply'
Control whether GNU CC uses the `mul' and `imul' that produce 64 bit results in `eax:edx' from 32 bit operands to do `long long' multiplies and 32-bit division by constants.
`-mrtd'
Use a different function-calling convention, in which functions that take a fixed number of arguments return with the `ret' NUM instruction, which pops their arguments while returning. This saves one instruction in the caller since there is no need to pop the arguments there.
You can specify that an individual function is called with this calling sequence with the function attribute `stdcall'. You can also override the `-mrtd' option by using the function attribute `cdecl'.
Warning: this calling convention is incompatible with the one normally used on Unix, so you cannot use it if you need to call libraries compiled with the Unix compiler.
Also, you must provide function prototypes for all functions that take variable numbers of arguments (including `printf'); otherwise incorrect code will be generated for calls to those functions.
In addition, seriously incorrect code will result if you call a function with too many arguments. (Normally, extra arguments are harmlessly ignored.)
`-mreg-alloc=REGS'
Control the default allocation order of integer registers. The string REGS is a series of letters specifying a register. The supported letters are: `a' allocate EAX; `b' allocate EBX; `c' allocate ECX; `d' allocate EDX; `S' allocate ESI; `D' allocate EDI; `B' allocate EBP.
`-mregparm=NUM'
Control how many registers are used to pass integer arguments. By default, no registers are used to pass arguments, and at most 3 registers can be used. You can control this behavior for a specific function by using the function attribute `regparm'.
`-malign-loops=NUM'
Align loops to a 2 raised to a NUM byte boundary. If `-malign-loops' is not specified, the default is 2.
`-malign-jumps=NUM'
Align instructions that are only jumped to to a 2 raised to a NUM byte boundary. If `-malign-jumps' is not specified, the default is 2 if optimizing for a 386, and 4 if optimizing for a 486.
`-malign-functions=NUM'
Align the start of functions to a 2 raised to NUM byte boundary. If `-malign-functions' is not specified, the default is 2 if optimizing for a 386, and 4 if optimizing for a 486.
GCOV: A Test Coverage Program
What is `gcov'?
`gcov' is an application which together with GCC can be used to test code coverage in programs, which in turn is useful for making code faster and more efficient. `gcov' can be used with `gprof' to isolate the hotspots in a program. It typically gives the following performance statistics:
- Whether a line of code is executed or not
- How often is a line of code executed
- For how long does a section of code execute
The Cover
Note that coverage is essential for creating testsuites for programs, because it essentially determines what extent of the of the code is tested by the testsuite.
No Optimization While Testing Coverage
The code should be compiled without optimization if `gcov' is to be used.
Also, `gcov' works best when the code contains one statement per line. Complicated macros should be avoided. When `gcov' is run, it would create a logfile "foo.c.gcov" for a code file "foo.c", which contains information as to how many times each line of "foo.c" has executed.
An Example of Dynamic Analysis
A simple example of using `gcov' is given here: consider that we have a file "loop100.c" that contains the following code:
int main() { int i,
j; for (i = 0; i < 100; i++) j = i; exit(0); } }
We have to compile code using two specific GCC options for using `gcov': `-fprofile-arcs' and `-ftest-coverage'. Thus,
% gcc -fprofile-arcs
-ftest-coverage loop100.c
This produces files loop100.bb and loop100.bbg.
Then we run the executable produced.
% ./a.out
This creates a loop100.da file. Now we run `gcov'.
% gcov ./loop100.c 80.00%
of 5 source lines executed in file loop100.c Creating loop100.c.gcov.
The contents of loop100.c.gcov are as follows:
int main() { 1 int i,
j; 101 for (i = 0; i < 100; i++) 100 j = i; 1 exit(0); ###### }
Next, we run `gcov' with the `-b' option, thus:
# gcov -b ./loop100.c
80.00% of 5 source lines executed in file loop100.c 100.00% of 3 branches
executed in file loop100.c 100.00% of 3 branches taken at least once in
file loop100.c 66.67% of 3 calls executed in file loop100.c Creating loop100.c.gcov.
Now loop100.c.gcov contains:
int main() { 1 int i,
j; 101 for (i = 0; i < 100; i++) branch 0 taken = 99% branch 1 taken
= 100% branch 2 taken = 100% 100 j = i; 1 exit(0); call 0 returns = 0%
###### } call 0 never executed call 1 returns = 100%
For each basic block, a line is printed after the last line of the basic block describing the branch or call that ends the basic block. There can be multiple branches and calls listed for a single source line if there are multiple basic blocks that end on that line. In this case, the branches and calls are each given a number. There is no simple way to map these branches and calls back to source constructs. In general, though, the lowest numbered branch or call will correspond to the leftmost construct on the source line.
For a branch, if it was executed at least once, then a percentage indicating the number of times the branch was taken divided by the number of times the branch was executed will be printed. Otherwise, the message "never executed" is printed.
For a call, if it was executed at least once, then a percentage
indicating the number of times the call returned divided by the number of
times the call was executed will be printed. This will usually be 100%, but
may be less for functions call exit or longjmp,
and thus may not return everytime they are called.
These execution counts can be accumulated over multiple runs of the program, which might be desirable as long-term information.
4. Measuring Code Performance And More
The Intel Architecture Time-Stamp Counter
With the Pentium processor onwards, the Intel Architecture defines (and implements) a time-stamp counter mechanism that can be used to monitor and identify the relative time of occurrence of processor events. The time-stamp counter architecture includes an instruction for reading the time-stamp counter (RDTSC), a feature bit (TSC flag) that can be read with the CPUID instruction, a time-stamp counter disable bit (TSD flag) in control register CR4, and a model-specific time-stamp counter.
The time-stamp counter is present in the Intel Pentium family as well as the P6 family of processors (which includes the Pentium Pro and the Pentium II). The counter is a 64-bit counter that is set to 0 when the processor undergoes a hardware reset. Thereafter the counter is incremented every clock cycle, even when the processor is halted by the HLT instruction, or the STPCLK# pin.
We use the RDTSC instruction to read the time-stamp counter and such a read is guaranteed to return a monotonically increasing unique value whenever executed, except for 64-bit counter wraparound. According to Intel documentation, the architecture guarantees that the time-stamp counter frequency and configuration will be such that it will not wraparound within 10 years after being reset to 0 (for example, this wraparound period would be several thousands of years in the current generation Intel processors).
Normally, the RDTSC instruction can be executed by programs and procedures running at an privilege level and in virtual-8086 mode, but the TSD flag in CR4 can be set so that this instruction can be used only by programs and procedures executing at privilege level 0 (this might be desirable in the case of a multi-user operating system, where the instruction is emulated through a user-accessible API).
Thus, the Time Stamp Counter has the following features:
- It is a dedicated and free-running 64-bit counter.
- It is present on chip.
- It increments every clock cycle (currently).
- Monotonically increasing values are reported by the RDTSC instruction.
- Bit 2 in CR4 is the TSD (Time Stamp Disable) bit.
- Programs with privilege level 0 may read the counter using the RDMSR instruction, and also reset/preset the counter using the WRMSR instruction. Note though that Intel warns that this is not an architectrural feature, and may not be supported on future processors.
- Upon being reset, the counter is cleared.
An API For Maintaining Event Counters In Linux
Documentation for the P5MSR kernel module and the P5MSRAPI programming interface is present in the software distribution. A similar interface for FreeBSD already exists.
Amit