Compilers

CSC 150 Chapter 10: Compilers
primary information resources: An Invitation to Computer Science (Java), Third Edition, G. Michael Schneider and Judith L. Gersting, Course Technology, 2007, and
Compilers: Principles, Techniques, and Tools, Alfred Aho, Ravi Sethi, Jeffrey Ullman, Addison Wesley, 1986.

[ previous | schedule | next ]

Overview

Compilers are language translators.

Nothing magic about a compiler -- it is a program whose input and output are also programs! The input program is written in a particular source language and the output program is the equivalent program expressed in object (assembly or machine) code.

For assembly language, we call these translators assemblers, not compilers. This is because of the one-to-one relationship between assembly language instructions and machine instructions. The translation is pretty easy, like translating between written French and Italian.

For high-level languages, the relationship between source language instructions and machine instructions is one-to-many. The translation process is not as simple, it is like translating between written French and Chinese.

Modern compilers follow a multi-phase compilation process. The major phases are:

Lexical Analysis. Scan source program to recognize tokens (syntactic units, such as keywords)
Syntax Analysis (Parsing). Determine whether tokens are arranged correctly according to language syntax.
Semantics Analysis. Determine correct program meaning.
Code Generation. Produce machine code equivalent.
Optimization. Tweak the generated code so it will run faster or use space better.

The compiler itself is structured accordingly:

These phases are carried out in the order listed.
You can think of the phases as being sequential. In reality, they are usually pipelined.
Each phase corresponds to a section of the compiler.
Some data structures, particularly symbol tables, are shared between phases.

Lexical Analysis

Every language has a set of basic syntactic (structural) units. In spoken and written languages, it is words; in mathematics, it is symbols, operators, and values.

Programming languages have syntactic units too. Java, for instance, has keywords, operators, identifiers, values, punctuation. These are called tokens.

Lexical analysis involves reading the source program a character at a time and translating it into a stream (sequence) of tokens.

Here is a numbered subset of valid Java tokens. Keywords are in bold:
1. identifier
2. number
3. if
4. else
5. for
6. class
7. public
8. private
9. int
10. (
11. )
12. {
13. }
14. +
15. *
16. =
17. ==
18. >
19. <
20. ;

This list of tokens could be stored in a table. As each input token is recognized, its table entry or even its table position is passed on to the parsing phase.

Processing the Java code

if (height > 72) {
ideal = 2*height+30;
} else {
ideal = 3*height+12;
}

Results in the token sequence:
3, 10, 1, 18, 2, 11, 12, 1, 16, 2, 15, 1, 14, 2, 20, 13, 4, 12, 1, 16, 2, 15, 1, 14, 2, 20, 13

Several things to note:

Blank spaces, line feeds, or the lack thereof, do not affect the ability to detect tokens.
More than just tokenizing happens here. Whenever an identifier (entry #1) is detected, it is added to the compiler’s symbol table if not already there. This is the table of program identifiers.
In reality, tokens may need to carry forward additional information (especially for semantic analysis). Look specifically at the number of times that 1 (identifier) and 2 (number) appear in the sequence. Later on, the compiler will need to distinguish individual identifiers and numbers. Therefore additional information such as a pointer to an identifier’s symbol table entry is carried along with the token.
Characters or sequences that do not match valid language tokens are flagged as errors. For example, if you mistype Java’s not symbol ! as @, the Java lexical analyzer will flag it as erroneous because @ is not a valid Java token. Compilers normally define special error tokens to represent these so compilation can continue after the first error is detected.

Syntax Analysis (Parsing)

Syntax refers to the arrangements, or grammar, of language tokens.

The example statement above shows correct syntax for Java (or C or C++ in this case),

if (height > 72) {
ideal = 2*height+30;
} else {
ideal = 3*height+12;
}

The same statement written in Pascal would be:

if height > 72 then begin
ideal := 2*height+30
end else begin
ideal := 3*height+12
end;

Their syntax differs in several minor respects, although both do exactly the same thing (equivalent semantics).

The analogy to natural language

The obvious analogy is to the grammar of natural languages. Words are tokens, and the syntax rules of a natural language specify which arrangements of words are grammatically correct.

Remember the Haiku project in CSC 120? Here is one of the four templates.
// Haiku template:
//    Article-adjective-noun-verb;
//    Article-adjective-noun
//    Preposition-article-adjective-noun

This specifies a syntactically correct form for the Haiku.

Although both consist of valid tokens, the phrase ”The thin man eats;” satisfies the syntax rule for the first line, whereas the phrase “Fast cats eat lunch” does not.

In general, grammar rules are specified hierarchically. Example on text page 478 shows the following rules for sentence structure:

sentence consists of subject followed by verb followed by object
subject consists of noun phrase
object consists of noun phrase
noun phrase consists of article followed by noun
article, noun, and verb are primitives (tokens)

Backus-Naur Form (BNF) Sometimes called Backus Normal Form.

This set of rules, or grammar, becomes clearer when specified using BNF:

This notation is easy to understand.

Each line is a production rule corresponding to a rule in the dotted list.
The vertical bars mean “or”, which is shorter than defining a different rule for every possibility.
The things enclosed in “<” “>” are called nonterminals. They are not themselves elements of the language but represent grammatical categories of the language. Every nonterminal must appear as the left-hand side of at least one production rule.
The words, (a, an, the, aardvark, …) are called terminals. They are elements of the language.
The symbols ::= and | are called metasymbols. They are part of the BNF notation and not the language itself.
One nonterminal is distinguished as the goal symbol. The goal symbol in this example is <sentence>. If, by applying rules, we can reduce a sentence to the symbol <sentence>, then it is a syntactically correct sentence in our language.

The parsing process for a candidate sentence proceeds as follows.
a.   convert the words into nonterminals (<article>, <noun>, or <verb>)
b.   look for a sequence which matches the right hand side of a rule. If none found, go to step e.
c.   replace the matching sequence with the symbol on the left hand side of the rule.
d.   go to step b.
e.   If the phrase consists only of the goal symbol <sentence> the sentence is in the language otherwise it is not.

Example: The aardvark abandons the carrot

1. Convert the words to a sequence of nonterminals.
result is: <article> <noun> <verb> <article> <noun>

2. reduce the first <article> <noun> to <noun phrase> by applying the fourth rule.
result is: <noun phrase> <verb> <article> <noun>

3. reduce the second <article> <noun> to <noun phrase> by applying the fourth rule.
result is: <noun phrase> <verb> <noun phrase>

4. reduce the first <noun phrase> to <subject> by applying the second rule.
result is: <subject> <verb> <noun phrase>

5. reduce the second <noun phrase> to <object> by applying the third rule.
result is: <subject> <verb> <object>

6. reduce the whole thing to <sentence> by applying the first rule.
result is: <sentence>

The series of reductions can be illustrated by drawing a parse tree. The previously-referenced textbook illustration is much better, but I will approximate it here:

      The    aardvark abandons the    carrot
       |        |         |       |        |
1. <article> <noun>    <verb> <article> <noun>
        \     /           |       \        /
2.   <noun phrase>        |        \      /
          |               |         \    /
3.        |               |     <noun phrase>
          |               |           |
4.     <subject>          |           |
           \              |           |
5.          \             |       <object>
              \           |          /
                \         |        /
                  \       |      /
                    \     |    /
                       \ | /
                         \|/
6.                    <sentence>

Example: A carrot aarvark the

1. Convert the words to nonterminals.
result is: <article> <noun> <noun> <article>

2. reduce the <article> <noun> to <noun phrase> by applying the fourth rule.
result is: <noun phrase> <noun> <article>

3. reduce the first <noun phrase> to <subject> by applying the second rule.
result is: <subject> <noun> <article>

4. Now what?? There is no rule with a matching right hand side, so we cannot reduce it further. It is not a valid sentence according to this grammar.

BNF grammars for programming languages

------------------------------------------
Behold the conceptual simplicity of Scheme syntax. Here's a grammar for a Lisp subset posted in 1994 by Barry Margolin of Thinking Machines Corp. Since Scheme is a Lisp dialect, its grammar would be similar.

Until you've seen the BNF grammar for a procedural language like Java or C, which require literally hundreds of production rules, you cannot appreciate the beautiful simplicity of this.
-------------------------------------------

Here is a grammar for some simple arithmetic expressions

<list> ::= <list> + <digit>
| <list> - <digit>
| <digit>
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Example: Here is the parse tree to recognize 7+4-3-2. Each line represents a grammar substitution.

     7     +     4    -     3    -     2
     |     |     |    |     |    |     |
<digit> | <digit> | <digit> | <digit>
     |     |     |    |     |    |     |
   <list> |     |    |     |    |     |
      \    |    /    /     /    /     /
        \ | /    /     /     /     /
        <list>   /     /      /     /
            \   /   /       /     /
             <list>        /    /
                \        /    /
                  \    /   /
                    <list>

A grammar can be used either to recognize or to generate valid strings in the language. Here is the corresponding tree to generate 7+4-3-2 from the grammar. Notice that the root is now at the top. Start with the goal symbol and look for matches with the LHS of rules. When match found, substitute RHS.

                     <list>
                       |
             <list>    -    <digit>
                |              |
                |              2
       <list>   -   <digit>
          |            |
          |            3
   <list> + <digit>
     |        |
     |        4
<digit>
     |
     7

Grammars and ambiguity

The following grammar recognizes and generates the same expressions

<string> ::= <string> + <string>
| <string> - <string>
| <digit>
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Example: Use this grammar to recognize the same expression 7+4-3-2:

Guess what? There's another way to recognize the same expression:

Different sequence of rule applications, same results. This grammar is ambiguous! This should be avoided if possible in grammar design. Go back to the previous grammar. Is there any other way to get 7+4-3-2?

Selecting the correct rule.

Let's extend our first grammar with two additional rules (in bold). <assign> is goal symbol.

Rule 1. <assign>   ::= <variable> = <list>
Rule 2. <list>     ::= <list> + <digit>
Rule 3.                 | <list> - <digit>
Rule 4.                 | <digit>
Rule 5. <digit>    ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Rule 6. <variable> ::= i | j | k | l | m | n | o

Example: parse the string k = 3 + 2

1. Apply rules 5 and 6 to "tokenize" all the terminals.
result is: <variable> = <digit> + <digit>

2. Apply rule 4 to the first digit.
result is: <variable> = <list> + <digit>

3. Apply rule 2.
result is: <variable> = <list>

4.Apply rule 1.
result is: <assign>

Goal achieved. Here's the parse tree.

          k      =     3     +     2
          |      |     |     |     |
     <variable> | <digit> | <digit>
           \     |     |     |    /
            \    | <list>   |   /
             \   |       \   | /
              \   \       <list>
                \ \        /
                  \ \     /
                   <assign>

Hey!! Look back at step 3. Our parse string was
<variable> = <list> + <digit>
and we applied rule 2 <list> ::= <list> + <digit>

We COULD have applied rule 1 instead,
<assign> ::= <variable> = <list>
What if we HAD applied it instead? Here's the parse tree.

          k      =     3     +     2
          |      |     |     |     |
     <variable> | <digit> | <digit>
           \     |     |     |     |
            \    | <list>   |     |
             \   |   /       |     |
              \ | /        |     |
              <assign>       ?     ?

The result would be
<assign> + <digit>
and no more rules could be applied. The result consists of more than the goal symbol, so we would erroneously reject the string.

This is a hazard to contend with: what if more than one rule applies and the wrong one is selected? The answer is this: when dead end is reached, backtrack and try a different rule. Reject only if all paths lead to dead ends.

There are in fact methods for defining grammars in such a way as to avoid ambiguity and minimize backtracking. Such topics are well beyond the scope of this course.

Moral of the story: good grammars are hard to define!

Semantics Analysis

Syntax analysis focuses on how the program is constructed on the surface.
Semantics analysis looks beneath the surface to the program’s meaning.

This is true for natural languages as well. The small natural language grammar above will accept the sentence “A carrot bit the aardvark.” However, this sentence has no real meaning.

Some semantic errors do not reveal themselves until runtime. For instance, the Java code:

int arr = new int[5];
int limit = Console.readInt(“how many values to enter?”);
for (int i=0; i < limit; i++) {
Arr[i] = Console.readInt(“enter the next value”);
}

will cause a runtime error if the value of limit is greater than 5, but the value of limit is not known until the program is run.

Other semantic errors can be detected at compile time. The foremost example is mismatched data types in expressions. Here is a Java example.

boolean b = false;
int i = 15;
char c = ‘f’;
double d;
d = b + i * c; // semantic errors here

This is syntactically correct but none of the operations in the assignment statement make sense according to Java rules.

How are semantic errors detected? What information is required to find the semantic errors in d = b + i * c;?

Think hard; remember that the semantic analyzer doesn’t see the source code itself – it only sees a stream of tokens!

Using the token table defined in the Lexical Analysis section, the above assignment is translated into the sequence: 1, 16, 1, 14, 1, 15, 1, 20.

The token sequence is sufficient for syntax analysis, but not for semantic analysis. Where does this information come from?

Look again at the example

boolean b = false;
int i = 15;
char c = ‘f’;
double d;
d = b + i * c;

Lexical analysis of this example

When lexical analyzer determines a token is an identifier (e.g. does not match any keywords), it adds the identifier into the symbol table if not already there.

After scanning the first four lines, the lexical analyzer would have this symbol table (the Type column is filled in by semantic analyzer):

Entry Identifier Type

0 b

1 i

2 c

3 d

While scanning the assignment statement, lexical analyzer will find that each identifier is already in the symbol table and will simply fetch its entry number.

The token stream for the last line then becomes 1(3), 16, 1(0), 14, 1(1), 15, 1(2), 20, where each “1” has an attached symbol table entry number in parentheses.

The attached link to symbol table is crucial for what is to come.

Syntax analysis of this example

Skipping ahead to syntax analysis (parsing) of the last line, the tokens are placed into “leaves” of the parse tree. Something like this:

1(3) 16    1(0)    14    1(1)   15   1(2)   20
|     |       \      |     \     |     /     /
\     \       \      \      \   |   /      /
\     \       \      \       \ | /       /
   \     \        \     \      temp1     /
    \     \         \    \      /      /
     \     \          \   \   /      /
      \     \           \ \ /     /
       \     \           temp2   /
         \    \          /    /
           \   \       /   /
             \ \    / /
                \\ / /
                temp3

Where temp1, temp2 and temp3 represent temporary identifiers generated to hold results of calculations. They are not stored in the symbol table.

Semantic Analysis of this example

The semantic analyzer must:

traverse the syntax tree from leaves down to the root,
create a semantic record (supplementary information) for each non-leaf (grammar non-terminal) in parse tree
use semantic records to check for semantic errors

The semantic record attached to each non-terminal in the parse tree contains information such as: identifier name, type, symbol table index.

While processing the segment of the parse tree for the first four statements, identifiers are associated with their types and this information is added to the symbol table. The resulting table is:

Entry Identifier Type

0 b boolean

1 i int

2 c char

3 d double

Semantic records are created too.

Next comes semantic analysis of the last line, the assignment. The multiplication “1(1) 15 1(2)” is done first.

To create the semantic record for “1(1)”, it looks up the symbol table entry 1 and copies the identifier (i) and data type (int) into the semantic record.
To create the semantic record for “1(2)”, it looks up the symbol table entry 2 and copies the identifier (c) and data type (char) into the semantic record.
Finally, it needs to create a semantic record for temp1, the result. It consults the semantic records for the two operands and discovers that the first is an int while the second is a char. Since Java does not allow multiplication for this combination of types, it generates an error.
Had the data types been correct, the semantic record for temp1 would have been filled in with its name (temp1) and type (int).

Code Generation

Given successful semantic analysis, the next phase involves traversing the parse tree from leaves to root once more and generating machine code appropriately along the way.

For readability, our examples will show assembly code instead of machine code.

Recall the parse tree is constructed by repeatedly applying production rules to reduce the structure representing program syntax down to a single point, the tree root which is the goal symbol.

Applying a rule usually corresponds to some action represented by machine code.

Let’s return to an earlier example, annotated with a number for each rule.

Rule 1. <assign>   ::= <variable> = <list>
Rule 2. <list>     ::= <list> + <digit>
Rule 3.                | <list> - <digit>
Rule 4.                | <digit>
Rule 5. <digit>    ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Rule 6. <variable> ::= i | j | k | l | m | n | o

The parse tree for k = 3 + 2 is

Let’s annotate the sequence of rule applications:

1a. Apply rule 6 to transform k to <variable> .
Syntax analysis: nothing special happens here.
Semantic analysis: Check symbol table to assure k has been declared. Create semantic record (k, int).
Code generation: No code needs to be generated here. k has been declared earlier in the program with int k; . The declaration itself is in a different part of the parse tree, which has already been processed through code generation. At that point of the tree, the pseudo-op
k: .DATA 0 was generated.

1b. Apply rule 5 to transform 3 to <digit> .
Syntax analysis: create an internal temporary identifier to represent the 3, call it const3.
Semantic analysis: Create semantic record with (const3, int).
Code generation: generate the pseudo-op const3: .DATA 3

1c. Apply rule 5 to transform 2 to <digit> .
Syntax analysis: create an internal temporary identifier to represent the 2, call it const2.
Semantic analysis: Create semantic record with (const2, int).
Code generation: generate the pseudo-op const2: .DATA 2

2. Apply rule 4 to the first digit.
Syntax analysis: nothing special to do.
Semantic analysis: Create semantic record with (const3, int).
Code generation: No code needs to be generated here.

3. Apply rule 2.
Syntax analysis: create internal temporary identifier to contain result of operation, call it temp1
Semantic analysis: <list> has semantic record (const3, int), <digit> has semantic record (const2, int). Both are int, so the operation is valid and produces an int result. Create semantic record with (temp1 , int).
Code generation: Generate code for storage temp1: .DATA   0
Generate code for the + operation:
   LOAD const3
   ADD   const2
   STORE temp1

4. Apply rule 1.
Syntax analysis: nothing special to do here.
Semantic analysis: <variable> has semantic record (k, int), <list> has semantic record (temp1 , int). Both are int, so the assignment is valid .
Code generation: Generate the code
LOAD temp1
STORE k

Finished. The combined generated code is:
     LOAD const3
     ADD   const2
     STORE temp1
     LOAD temp1
     STORE k
(separated from the instructions)
k:      .DATA 0
const3: .DATA 3
const2: .DATA 2
temp1: .DATA 0

Optimization

You may have noticed that the code produced in the example above can be improved to run faster without changing its result. The code was:

   LOAD   const3
   ADD    const2
   STORE temp1
   LOAD   temp1
   STORE k

k:     .DATA   0
const3 .DATA   3
const2 .DATA   2
temp1 .DATA   0

1. Notice that the result of the addition is stored from register R to temp1, then immediately the value from temp1 is loaded back into R. The LOAD operation is unnecessary, so can be eliminated. Resulting code:

   LOAD const3
   ADD   const2
   STORE temp1
   STORE k

k:     .DATA   0
const3 .DATA   3
const2 .DATA   2
temp1 .DATA   0

2. Next, notice that the result of the addition is stored from register R to temp1, then immediately stored again, from register R to k. Since the value stored in temp1 will not be used for anything, it is unnecessary and can be eliminated. Resulting code:

   LOAD const3
   ADD   const2
   STORE k

k:     .DATA   0
const3 .DATA   3
const2 .DATA   2

3. Finally, notice that both operands of the addition are constants so their sum can be calculated prior to runtime. Create a temporary const5 to hold the result. This eliminates the need to perform the addition. Resulting code:

LOAD const5
STORE k

k: .DATA 0
const5 .DATA 5

All three of these are called local optimization because they were focused on a small segment of code representing one source statement, the optimization window.

Global optimization looks at the bigger picture (e.g. more than one statement) and is more difficult to do. Most global optimization techniques focus on loops, since most execution time is spent iterating.

Example of global optimizations in Java:

int j = limit;
for (int i = 0; i < limit-2; i=i+1) {
     int extra = 3;
     list[i] = i + extra;
     item[i] = factor * (i + extra);
     j = j – 1;
     blah[i] = 4 * j - 3;
}

1. There is no need to allocate storage for extra and assign 3 to it over and over again. Pull this statement up out of the loop so it is done once. Similarly, there is no need to calculate limit-2 repeatedly, so pull this calculation out of the loop. These are code motion optimizations. Result is:

int j = limit;
int extra = 3;
int t1 = limit-2;
for (int i = 0; i < t1; i=i+1) {
     list[i] = i + extra;
     item[i] = factor * (i + extra);
     j = j – 1;
     blah[i] = 4 * j - 3;
}

2. Notice that the expression i+extra is calculated twice. It can’t be moved out of the loop because of i, but can be calculated once and assigned to a temporary. This saves one calculation, and is called common subexpression elimination. Result is:

int j = limit;
int extra = 3;
int t1 = limit-2;
int t2;
for (int i = 0; i < t1; i=i+1) {
     t2 = i + extra;
     list[i] = t2;
     item[i] = factor * t2;
     j = j – 1;
     blah[i] = 4 * j - 3;
}

3. Here’s another optimization that is not at all obvious, but can be correctly applied. Look at the last two assignments.

j = j – 1;
blah[i] = 4 * j - 3;

where j is initialized to limit. Notice that the value of j is reduced by 1 on each iteration. This means the value of 4*j-3 is reduced by 4 on each iteration. So we can substitute subtraction for multiplication. This is an example of strength reduction. Here’s the result:

int j = limit;
int extra = 3;
int t1 = limit - 2;
int t2;
int t3 = 4 * limit - 3;
for (int i = 0; i < t1; i=i+1) {
     t2 = i + extra;
     list[i] = t2;
     item[i] = factor * t2;
     j = j – 1;
     t3 = t3 - 4;
     blah[i] = t3;
}

The original loop contained 2 multiplications, 6 additions/subtractions, 6 assignments and 1 memory allocation per iteration.

The optimized loop contains 1 multiplication, 4 additions/subtractions, 7 assignments and 0 memory allocations per iteration.

The optimized code requires more space for 3 temporaries, and several additional operations just above the loop.

The higher the value of limit, the better the optimized code looks!

[ C SC 150 | Peter Sanderson | Math Sciences server | Math Sciences home page | Otterbein ]

Last updated:
Peter Sanderson (PSanderson@otterbein.edu)