kernelthread.com

A Handbook on Writing Better C Programs with GNU CC

Contents

  1. Notes On C Programming For Better Code
  2. The GNU C Compiler
  3. GCOV: A Test Coverage Program
  4. Measuring Code Performance And More

1. Notes On C Programming For Better Code

This section contains some techniques that are usually useful in producing better code while programming in the C language. These techniques help in the optimization process in the sense that they often yield better code to begin with, which is either reasonably optimal, or lends itself to further tuning more easily.

Table Lookup

Sometimes it is useful to pre-compute a table of often used functions (the trigonometric functions are examples), and use an offset into the table instead of actually computing the value of the function. This is an often used trick in graphics programs. Note that if the table happens to be rather large, then the gains will saturate at some point. It is also a possibility that one pre-computes the more often used values only and the rest of the values are computed normally - an idea that is similar to that of the cache (here we are sort of caching the values of the function).

Inlining

Though calling conventions differ across various languages, there is invariably an overhead. Functions that are small and frequently called are most prone to this overhead. Inlining such functions is usually a good idea. Inlining leads to an increase in code size, so one must be wary of that. Moreover, macros may have side effects, and the problems may be hard to trace if the macro is called with arguments such as *p++ when the corresponding argument appears more than once in the macro. Moreover, code profilers don't see macros.

Strength Reduction

This is a technique wherein a complex expression is replaced by a simpler (easier to compute) expression whose result is the same. Following are a few examples.

Costlier Expression Cheaper Expression
x = y / z + w / z x = (y + w) / z
x = pow(y, 2.0) x = y * y
x = y * 63 x = (y << 6) - y
x = y % 8 x = y & 7
for (i = 0; i < 10; i++) {

j = 10 * i;

something(j);

}

for (i = 0; i < 100; i += 10) {

/* */

something(i);

}

#define abs(x) (((x>0)?(x):-(x)) static long abs (long x) {

long t;

t = x >> 31;

return (x ^ t) - t;

}

Unrolling Loops

Consider the following loop:

for (i = 0; i < 1000; i++) { loop_code(i); }

An unrolled loop looks like the following:

for (i = 0; i <) { loop_code(i); i++; loop_code(i); i++; loop_code(i); i++; loop_code(i); i++; loop_code(i); i++; loop_code(i); i++; loop_code(i); i++; loop_code(i); i++; loop_code(i); i++; loop_code(i); i++; }

The loop will be executed only 100 times, instead of the original 1000 times. If the upper limit is not so big a number (1000 is big), then the loop can be unrolled completely, which will fully eliminate the branching and testing. Loop unrolling may not always be beneficial: some computations may converge on some result, and anyway an unrolled loop is more code - so code bloat is an undesirable side-effect to watch for.

Jam the Loops

If there are adjacent loops that loop over the same range of a single variable, then it is obvious that the loops be combined. This will cause the incrementing and testing of the loop variable to be done half as often. Of course, this is one of the most obvious 'optimizations'.

Loop Inversion

On most machines, there is an instruction for decrementing and testing against 0 (the x86 is an example of such a machine). So, instead of the following:

for (i = 0; i <= N; i++) { loop_code(i); }

it is better to use the following code:

... for (; i-- >= 0; ) { loop_code(i); }

Loop Invariant Code

That portion of a loop-code that does not use the loop variable, or has no side effects upon the same, should be moved out of the loop. Also consider something like:

... /* somewhere in a loop */ sum = struct_array[i].x + struct_array[i].y + \ struct_array[i].z;

it is usually better to use:

... p = struct_array[i]; sum = p.x + p.y + p.z;

CSE: Common Subexpression Elimination

Consider a computation that looks like the following:

p = x / y * z; q = x * w / y;

Here x / y is a common subexpression that doesn't need to be calculated twice.

Keeping Variables in Registers

Functions that take pointers to variables are arguments are potentially capable of changing the value of the variables. Then the optimizer will not leave the variable's value in a register. Thus it is usually better not to pass references to variables unless so desired.

Being Friendly with the Stack

Avoid having large arrays as local variables. Instead, have global allocation for them, if possible. The space for such large variables can come from the heap instead of the stack (there are machines on which large stacks are a problem, though it is not the case with the x86).

Consider another case:

int function_foo() { int a, b, c; do_something(a, b, c); if (condition_xyz) return another_function(); else return 1; }

Since function_foo()'s last statement is to call another_function() and function_foo() has no need of its variables after that point, it can remove its stack frame. When another_function() is done it returns directly to function_foo()'s caller. This will reduce the maximum depth of the stack, and it also saves the execution of some return-from-subroutine code as it will get executed only once instead of twice (or more, depending on how deeply the function results are passed along.)

Locality of Reference

It is common practice to exploit locality of reference while accessing, say, multi-dimensional arrays. For example, if one has to loop through the elements of a 2-dimensional array, then the inner loop should run over the rightmost index.

Branch Probabilities

When it happens that there is a high probability that a condition will be satisfied one way or the other, then one can attempt to reduce misprediction of the branch (which may be quite costly in terms of processor cycles in modern-day deeply pipelined processors). Let us consider an example:

if (condition) { case a; } else { case b; }

One might try to rewrite the above code as follows:

case b; if (condition) { undo_effects(case b); case b; }

Of course, this makes sense if it is possible to undo the effects of case b. If it is, this technique has the advantage that the jump taken case can be optimized according to the conditional probability and undo_effects(case b); case a; might be merged together to be more optimal than executing each separately.

2. The GNU C Compiler

GCC (the GNU C Compiler) is an optimizing compiler with frontends for languages like C, C++, Objective-C, Ada 9X, Fortran, Modula-3 and Pascal. GCC is eminently portable, and is available "freely", under the terms of the GPL (GNU Public License).

GCC has a large number of extensions to the usual language families (for example, it provides several language features not found in ANSI standard C), which makes it a very flexible and powerful compiler. This section discusses program optimization while using GCC.

Portions from GCC documentation follow (almost verbatim):

Options That Control Optimization

These options control various sorts of optimizations:

`-O'

`-O1'

`-O2'

`-O3'

`-O0'

`-Os'

Most flags have both positive and negative forms; the negative form of `-ffoo' would be `-fno-foo'. In the table below, only one of the forms is listed -- the one which is not the default. You can figure out the other form by either removing `no-' or adding it.

`-ffloat-store'

`-fno-default-inline'

`-fno-defer-pop'

`-fforce-mem'

`-fforce-addr'

`-fomit-frame-pointer'

`-fno-inline'

`-finline-functions'

`-fkeep-inline-functions'

`-fkeep-static-consts'

`-fno-function-cse'

`-ffast-math'

The following options control specific optimizations. The `-O2' option turns on all of these optimizations except `-funroll-loops' and `-funroll-all-loops'. On most machines, the `-O' option turns on the `-fthread-jumps' and `-fdelayed-branch' options, but specific machines may handle it differently.

One can use the following flags in the rare cases when "fine-tuning" of optimizations to be performed is desired.

`-fstrength-reduce'

`-fthread-jumps'

`-fcse-follow-jumps'

`-fcse-skip-blocks'

`-frerun-cse-after-loop'

`-frerun-loop-opt'

`-fexpensive-optimizations'

`-fdelayed-branch'

`-fschedule-insns'

`-fschedule-insns2'

`-ffunction-sections'

`-fcaller-saves'

`-funroll-loops'

`-funroll-all-loops'

`-fmove-all-movables'

`-freduce-all-givs'

`-fno-peephole'

`-fbranch-probabilities'

`-fregmove'

Intel x86 Options

`-mcpu=CPU TYPE'

`-march=CPU TYPE'

`-msoft-float'

`-mno-fp-ret-in-387'

`-mno-fancy-math-387'

`-malign-double'

`-mno-align-double'

`-mno-wide-multiply'

`-mwide-multiply'

`-mrtd'

`-mreg-alloc=REGS'

`-mregparm=NUM'

`-malign-loops=NUM'

`-malign-jumps=NUM'

`-malign-functions=NUM'

GCOV: A Test Coverage Program

What is `gcov'?

`gcov' is an application which together with GCC can be used to test code coverage in programs, which in turn is useful for making code faster and more efficient. `gcov' can be used with `gprof' to isolate the hotspots in a program. It typically gives the following performance statistics:

  1. Whether a line of code is executed or not
  2. How often is a line of code executed
  3. For how long does a section of code execute

The Cover

Note that coverage is essential for creating testsuites for programs, because it essentially determines what extent of the of the code is tested by the testsuite.

No Optimization While Testing Coverage

The code should be compiled without optimization if `gcov' is to be used.

Also, `gcov' works best when the code contains one statement per line. Complicated macros should be avoided. When `gcov' is run, it would create a logfile "foo.c.gcov" for a code file "foo.c", which contains information as to how many times each line of "foo.c" has executed.

An Example of Dynamic Analysis

A simple example of using `gcov' is given here: consider that we have a file "loop100.c" that contains the following code:

int main() { int i, j; for (i = 0; i < 100; i++) j = i; exit(0); } }

We have to compile code using two specific GCC options for using `gcov': `-fprofile-arcs' and `-ftest-coverage'. Thus,

% gcc -fprofile-arcs -ftest-coverage loop100.c

This produces files loop100.bb and loop100.bbg. Then we run the executable produced.

% ./a.out

This creates a loop100.da file. Now we run `gcov'.

% gcov ./loop100.c 80.00% of 5 source lines executed in file loop100.c Creating loop100.c.gcov.

The contents of loop100.c.gcov are as follows:

int main() { 1 int i, j; 101 for (i = 0; i < 100; i++) 100 j = i; 1 exit(0); ###### }

Next, we run `gcov' with the `-b' option, thus:

# gcov -b ./loop100.c 80.00% of 5 source lines executed in file loop100.c 100.00% of 3 branches executed in file loop100.c 100.00% of 3 branches taken at least once in file loop100.c 66.67% of 3 calls executed in file loop100.c Creating loop100.c.gcov.

Now loop100.c.gcov contains:

int main() { 1 int i, j; 101 for (i = 0; i < 100; i++) branch 0 taken = 99% branch 1 taken = 100% branch 2 taken = 100% 100 j = i; 1 exit(0); call 0 returns = 0% ###### } call 0 never executed call 1 returns = 100%

For each basic block, a line is printed after the last line of the basic block describing the branch or call that ends the basic block. There can be multiple branches and calls listed for a single source line if there are multiple basic blocks that end on that line. In this case, the branches and calls are each given a number. There is no simple way to map these branches and calls back to source constructs. In general, though, the lowest numbered branch or call will correspond to the leftmost construct on the source line.

For a branch, if it was executed at least once, then a percentage indicating the number of times the branch was taken divided by the number of times the branch was executed will be printed. Otherwise, the message "never executed" is printed.

For a call, if it was executed at least once, then a percentage indicating the number of times the call returned divided by the number of times the call was executed will be printed. This will usually be 100%, but may be less for functions call exit or longjmp, and thus may not return everytime they are called.

These execution counts can be accumulated over multiple runs of the program, which might be desirable as long-term information.

4. Measuring Code Performance And More

The Intel Architecture Time-Stamp Counter

With the Pentium processor onwards, the Intel Architecture defines (and implements) a time-stamp counter mechanism that can be used to monitor and identify the relative time of occurrence of processor events. The time-stamp counter architecture includes an instruction for reading the time-stamp counter (RDTSC), a feature bit (TSC flag) that can be read with the CPUID instruction, a time-stamp counter disable bit (TSD flag) in control register CR4, and a model-specific time-stamp counter.

The time-stamp counter is present in the Intel Pentium family as well as the P6 family of processors (which includes the Pentium Pro and the Pentium II). The counter is a 64-bit counter that is set to 0 when the processor undergoes a hardware reset. Thereafter the counter is incremented every clock cycle, even when the processor is halted by the HLT instruction, or the STPCLK# pin.

We use the RDTSC instruction to read the time-stamp counter and such a read is guaranteed to return a monotonically increasing unique value whenever executed, except for 64-bit counter wraparound. According to Intel documentation, the architecture guarantees that the time-stamp counter frequency and configuration will be such that it will not wraparound within 10 years after being reset to 0 (for example, this wraparound period would be several thousands of years in the current generation Intel processors).

Normally, the RDTSC instruction can be executed by programs and procedures running at an privilege level and in virtual-8086 mode, but the TSD flag in CR4 can be set so that this instruction can be used only by programs and procedures executing at privilege level 0 (this might be desirable in the case of a multi-user operating system, where the instruction is emulated through a user-accessible API).

Thus, the Time Stamp Counter has the following features:

  1. It is a dedicated and free-running 64-bit counter.
  2. It is present on chip.
  3. It increments every clock cycle (currently).
  4. Monotonically increasing values are reported by the RDTSC instruction.
  5. Bit 2 in CR4 is the TSD (Time Stamp Disable) bit.
  6. Programs with privilege level 0 may read the counter using the RDMSR instruction, and also reset/preset the counter using the WRMSR instruction. Note though that Intel warns that this is not an architectrural feature, and may not be supported on future processors.
  7. Upon being reset, the counter is cleared.

An API For Maintaining Event Counters In Linux

Documentation for the P5MSR kernel module and the P5MSRAPI programming interface is present in the software distribution. A similar interface for FreeBSD already exists.

Amit