CSC 150 Chapter 10: Compilers
primary information resources:
An Invitation to Computer Science (Java), Third Edition,
G. Michael Schneider and Judith L. Gersting,
Course Technology, 2007, and
Compilers: Principles, Techniques, and Tools,
Alfred Aho, Ravi Sethi, Jeffrey Ullman,
Addison Wesley, 1986.
[ previous | schedule | next ]
Compilers are language translators.
Nothing magic about a compiler -- it is a program whose input and output are also programs! The input program is written in a particular source language and the output program is the equivalent program expressed in object (assembly or machine) code.
For assembly language, we call these translators assemblers, not compilers. This is because of the one-to-one relationship between assembly language instructions and machine instructions. The translation is pretty easy, like translating between written French and Italian.
For high-level languages, the relationship between source language instructions and machine instructions is one-to-many. The translation process is not as simple, it is like translating between written French and Chinese.
Modern compilers follow a multi-phase compilation process. The major phases are:
The compiler itself is structured accordingly:
Every language has a set of basic syntactic (structural) units. In spoken and written languages, it is words; in mathematics, it is symbols, operators, and values.
Programming languages have syntactic units too. Java, for instance, has keywords, operators, identifiers, values, punctuation. These are called tokens.
Lexical analysis involves reading the source program a character at a time and translating it into a stream (sequence) of tokens.
Here is a numbered subset of valid Java tokens. Keywords are in
bold:
1. identifier
2. number
3. if
4. else
5. for
6. class
7. public
8. private
9. int
10. (
11. )
12. {
13. }
14. +
15. *
16. =
17. ==
18. >
19. <
20. ;
This list of tokens could be stored in a table. As each input token is recognized, its table entry or even its table position is passed on to the parsing phase.
Processing the Java code
if (height > 72) {
ideal = 2*height+30;
} else {
ideal = 3*height+12;
}
Results in the token sequence:
3, 10, 1, 18, 2, 11, 12, 1, 16, 2, 15, 1, 14, 2, 20, 13, 4, 12, 1,
16, 2, 15, 1, 14, 2, 20, 13
Several things to note:
Syntax refers to the arrangements, or grammar, of language tokens.
The example statement above shows correct syntax for Java (or C or C++ in this case),
if (height > 72) {
ideal = 2*height+30;
} else {
ideal = 3*height+12;
}
The same statement written in Pascal would be:
if height > 72 then begin
ideal := 2*height+30
end else begin
ideal := 3*height+12
end;
Their syntax differs in several minor respects, although both do exactly
the same thing (equivalent semantics).
The analogy to natural language
The obvious analogy is to the grammar of natural languages. Words are tokens, and the syntax rules of a natural language specify which arrangements of words are grammatically correct.
Remember the Haiku project in CSC 120? Here is one of the four
templates.
// Haiku template:
// Article-adjective-noun-verb;
// Article-adjective-noun
// Preposition-article-adjective-noun
This specifies a syntactically correct form for the Haiku.
Although both consist of valid tokens, the phrase ”The thin man eats;” satisfies the syntax rule for the first line, whereas the phrase “Fast cats eat lunch” does not.
In general, grammar rules are specified hierarchically. Example on text page 478 shows the following rules for sentence structure:
This set of rules, or grammar, becomes clearer when specified using BNF:
<sentence> ::= <subject> <verb>
<object>
<subject> ::= <noun phrase>
<object> ::= <noun phrase>
<noun phrase> ::= <article> <noun>
<article> ::= a |
an | the
<noun> ::= aardvark
| carrot | dog | man
<verb> ::= abandons |
abashes | bit
This notation is easy to understand.
Example: The aardvark abandons the carrot
1. Convert the words to a sequence of nonterminals.
result is: <article> <noun> <verb> <article> <noun>
2. reduce the first <article> <noun> to <noun
phrase> by applying the fourth rule.
result is: <noun phrase> <verb> <article> <noun>
3. reduce the second <article> <noun> to <noun
phrase> by applying the fourth rule.
result is: <noun phrase> <verb> <noun phrase>
4. reduce the first <noun phrase> to <subject>
by applying the second rule.
result is: <subject> <verb> <noun phrase>
5. reduce the second <noun phrase> to <object>
by applying the third rule.
result is: <subject> <verb> <object>
6. reduce the whole thing to <sentence> by applying the
first rule.
result is: <sentence>
The series of reductions can be illustrated by drawing a parse tree. The previously-referenced textbook illustration is much better, but I will approximate it here:
The aardvark
abandons the carrot
|
| |
| |
1. <article> <noun> <verb> <article>
<noun>
\
/ |
\ /
2. <noun phrase>
| \
/
|
| \ /
3. |
| <noun phrase>
|
| |
4. <subject>
| |
\
| |
5. \
| <object>
\ |
/
\ |
/
\ | /
\ | /
\ | /
\|/
6.
<sentence>
Example: A carrot aarvark the
1. Convert the words to nonterminals.
result is: <article> <noun> <noun> <article>
2. reduce the <article> <noun> to <noun phrase> by applying
the fourth rule.
result is: <noun phrase> <noun> <article>
3. reduce the first <noun phrase> to <subject> by applying the
second rule.
result is: <subject> <noun> <article>
4. Now what?? There is no rule with a matching right hand side,
so we cannot reduce it further. It is not a valid sentence according
to this grammar.
BNF grammars for programming languages
------------------------------------------
Behold the conceptual simplicity of Scheme syntax. Here's a grammar
for a Lisp subset posted in 1994 by Barry Margolin of Thinking Machines
Corp. Since Scheme is a Lisp dialect, its grammar would be
similar.
<start> ::= <s_exp>
<s_exp> ::= <atom> |
<list>
<atom> ::= NUMBER
| SYMBOL | STRING
<list> ::= ( )
| ( <s_exp_list> )
| ( <s_exp_list> . <s_exp> )
| \' <s_exp>
| SHARP_QUOTE <s_exp>
<s_exp_list> ::= <s_exp> | <s_exp_list> <s_exp>
Until you've seen the BNF grammar for a procedural language like Java
or C, which require literally hundreds of production rules, you cannot
appreciate the beautiful simplicity of this.
-------------------------------------------
Here is a grammar for some simple arithmetic expressions
<list> ::= <list> + <digit>
| <list> - <digit>
| <digit>
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Example: Here is the parse tree to recognize 7+4-3-2.
Each line represents a grammar substitution.
7 +
4 - 3 -
2
| |
| | | |
|
<digit> | <digit> | <digit> |
<digit>
| |
| | | |
|
<list> | |
| | |
|
\ |
/ / / /
/
\ | /
/ / /
/
<list>
/ / /
/
\ / / /
/
<list> /
/
\ / /
\ / /
<list>
A grammar can be used either to recognize or to generate valid strings in the language. Here is the corresponding tree to generate 7+4-3-2 from the grammar. Notice that the root is now at the top. Start with the goal symbol and look for matches with the LHS of rules. When match found, substitute RHS.
<list>
|
<list> - <digit>
|
|
|
2
<list> -
<digit>
|
|
|
3
<list> + <digit>
|
|
|
4
<digit>
|
7
Grammars and ambiguity
The following grammar recognizes and generates the same expressions
<string> ::= <string> + <string>
| <string> - <string>
| <digit>
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
9
Example: Use this grammar to recognize the same expression 7+4-3-2:
7 +
4 - 3 -
2
| |
| | | |
|
<digit> | <digit> | <digit> |
<digit>
| |
| | | |
|
<string> | <string> | <string> | <string>
\ |
/ / / /
/
\ | /
/ / /
/
<string> /
/ / /
\ / / /
/
<string>
/ /
\ / /
\ / /
<string>
Guess what? There's another way to recognize the same expression:
7 +
4 - 3 -
2
| |
| | | |
|
<digit> | <digit> | <digit> |
<digit>
| |
| | | |
|
<string> | <string> | <string> | <string>
\ \
\ \ \ |
/
\
\ \ \
\ | /
\
\ \ \ <string>
\ \ \ \
/
\ \ <string>
\ \ /
\ \ /
<string>
Different sequence of rule applications, same results. This grammar
is ambiguous! This should be avoided if possible in grammar design.
Go back to the previous grammar. Is there any other way to get 7+4-3-2?
Selecting the correct rule.
Let's extend our first grammar with two additional rules (in bold). <assign> is goal symbol.
Rule 1. <assign> ::= <variable> = <list>
Rule 2. <list> ::= <list> + <digit>
Rule 3.
| <list> - <digit>
Rule 4.
| <digit>
Rule 5. <digit> ::= 0 | 1 | 2 | 3 | 4 | 5
| 6 | 7 | 8 | 9
Rule 6. <variable> ::= i | j | k | l | m | n | o
Example: parse the string k = 3 + 2
1. Apply rules 5 and 6 to "tokenize" all the terminals.
result is: <variable> = <digit> + <digit>
2. Apply rule 4 to the first digit.
result is: <variable> = <list> + <digit>
3. Apply rule 2.
result is: <variable> = <list>
4.Apply rule 1.
result is: <assign>
Goal achieved. Here's the parse tree.
k
= 3 +
2
|
| | |
|
<variable> | <digit>
| <digit>
\
| | |
/
\ | <list> | /
\ | \ |
/
\ \ <list>
\ \ /
\ \ /
<assign>
Hey!! Look back at step 3. Our parse string was
<variable> = <list> + <digit>
and we applied rule 2 <list> ::= <list> + <digit>
We COULD have applied rule 1 instead,
<assign> ::= <variable> = <list>
What if we HAD applied it instead? Here's the parse tree.
k
= 3 +
2
|
| | |
|
<variable> | <digit>
| <digit>
\
| | |
|
\ | <list> |
|
\ | / |
|
\ | / |
|
<assign> ?
?
The result would be
<assign> + <digit>
and no more rules could be applied. The result consists of more
than the goal symbol, so we would erroneously reject the string.
This is a hazard to contend with: what if more than one rule applies and the wrong one is selected? The answer is this: when dead end is reached, backtrack and try a different rule. Reject only if all paths lead to dead ends.
There are in fact methods for defining grammars in such a way as to avoid ambiguity and minimize backtracking. Such topics are well beyond the scope of this course.
Moral of the story: good grammars are hard to define!
Syntax analysis focuses on how the program is constructed on the surface.
Semantics analysis looks beneath the surface to the program’s meaning.
This is true for natural languages as well. The small natural language grammar above will accept the sentence “A carrot bit the aardvark.” However, this sentence has no real meaning.
Some semantic errors do not reveal themselves until runtime. For instance, the Java code:
int arr = new int[5];
int limit = Console.readInt(“how many values to enter?”);
for (int i=0; i < limit; i++) {
Arr[i] = Console.readInt(“enter the next value”);
}
will cause a runtime error if the value of limit is greater than 5,
but the value of limit is not known until the program is run.
Other semantic errors can be detected at compile time. The foremost example is mismatched data types in expressions. Here is a Java example.
boolean b = false;
int i = 15;
char c = ‘f’;
double d;
d = b + i * c; // semantic errors here
This is syntactically correct but none of the operations in the assignment statement make sense according to Java rules.
How are semantic errors detected? What information is required to find the semantic errors in d = b + i * c;?
Think hard; remember that the semantic analyzer doesn’t see the source code itself – it only sees a stream of tokens!
Using the token table defined in the Lexical Analysis section, the above assignment is translated into the sequence: 1, 16, 1, 14, 1, 15, 1, 20.
The token sequence is sufficient for syntax analysis, but not for semantic
analysis. Where does this information come from?
Look again at the example
boolean b = false;
int i = 15;
char c = ‘f’;
double d;
d = b + i * c;
Lexical analysis of this example
When lexical analyzer determines a token is an identifier (e.g. does not match any keywords), it adds the identifier into the symbol table if not already there.
After scanning the first four lines, the lexical analyzer would have
this symbol table (the Type column is filled in by semantic analyzer):
Entry | Identifier | Type |
0 | b | |
1 | i | |
2 | c | |
3 | d |
While scanning the assignment statement, lexical analyzer will find that each identifier is already in the symbol table and will simply fetch its entry number.
The token stream for the last line then becomes 1(3), 16, 1(0), 14, 1(1), 15, 1(2), 20, where each “1” has an attached symbol table entry number in parentheses.
The attached link to symbol table is crucial for what is to come.
Syntax analysis of this example
Skipping ahead to syntax analysis (parsing) of the last line, the tokens are placed into “leaves” of the parse tree. Something like this:
1(3) 16 1(0) 14
1(1) 15 1(2) 20
| |
\ | \
| / /
\ \
\ \ \
| / /
\ \
\ \ \
| / /
\ \
\ \ temp1
/
\ \
\ \ /
/
\ \
\ \ / /
\ \
\ \ / /
\
\ temp2
/
\
\ /
/
\
\ / /
\ \ / /
\\ / /
temp3
Where temp1, temp2 and temp3 represent temporary
identifiers generated to hold results of calculations. They are not
stored in the symbol table.
Semantic Analysis of this example
The semantic analyzer must:
The semantic record attached to each non-terminal in the parse tree
contains information such as: identifier name, type, symbol table index.
While processing the segment of the parse tree for the first four statements,
identifiers are associated with their types and this information is added
to the symbol table. The resulting table is:
Entry | Identifier | Type |
0 | b | boolean |
1 | i | int |
2 | c | char |
3 | d | double |
Semantic records are created too.
Next comes semantic analysis of the last line, the assignment. The multiplication “1(1) 15 1(2)” is done first.
Given successful semantic analysis, the next phase involves traversing the parse tree from leaves to root once more and generating machine code appropriately along the way.
For readability, our examples will show assembly code instead of machine code.
Recall the parse tree is constructed by repeatedly applying production rules to reduce the structure representing program syntax down to a single point, the tree root which is the goal symbol.
Applying a rule usually corresponds to some action represented by machine code.
Let’s return to an earlier example, annotated with a number for each rule.
Rule 1. <assign> ::= <variable> = <list>
Rule 2. <list> ::= <list> + <digit>
Rule 3.
| <list> - <digit>
Rule 4.
| <digit>
Rule 5. <digit> ::= 0 | 1 | 2 | 3 | 4 | 5
| 6 | 7 | 8 | 9
Rule 6. <variable> ::= i | j | k | l | m | n | o
The parse tree for k = 3 + 2 is
k
= 3 +
2
|
| | |
|
<variable> | <digit>
| <digit>
\
| | |
/
\ | <list> | /
\ | \ |
/
\ \ <list>
\ \ /
\ \ /
<assign>
Let’s annotate the sequence of rule applications:
1a. Apply rule 6 to transform k to <variable> .
Syntax analysis: nothing special happens here.
Semantic analysis: Check symbol table to assure k has been declared.
Create semantic record (k, int).
Code generation: No code needs to be generated here.
k has been declared earlier in the program with int k; .
The declaration itself is in a different part of the parse tree, which
has already been processed through code generation. At that point
of the tree, the pseudo-op
k: .DATA 0 was generated.
1b. Apply rule 5 to transform 3 to <digit> .
Syntax analysis: create an internal temporary identifier
to represent the 3, call it const3.
Semantic analysis: Create semantic record with (const3,
int).
Code generation: generate the pseudo-op const3:
.DATA 3
1c. Apply rule 5 to transform 2 to <digit> .
Syntax analysis: create an internal temporary identifier
to represent the 2, call it const2.
Semantic analysis: Create semantic record with (const2,
int).
Code generation: generate the pseudo-op const2:
.DATA 2
2. Apply rule 4 to the first digit.
Syntax analysis: nothing special to do.
Semantic analysis: Create semantic record with (const3, int).
Code generation: No code needs to be generated here.
3. Apply rule 2.
Syntax analysis: create internal temporary identifier to contain
result of operation, call it temp1
Semantic analysis: <list> has semantic record (const3,
int), <digit> has semantic record (const2, int). Both are
int, so the operation is valid and produces an int result. Create
semantic record with (temp1 , int).
Code generation: Generate code for storage temp1:
.DATA 0
Generate code for the + operation:
LOAD const3
ADD const2
STORE temp1
4. Apply rule 1.
Syntax analysis: nothing special to do here.
Semantic analysis: <variable> has semantic record (k,
int), <list> has semantic record (temp1 , int). Both are
int, so the assignment is valid .
Code generation: Generate the code
LOAD temp1
STORE k
Finished. The combined generated code is:
LOAD const3
ADD const2
STORE temp1
LOAD temp1
STORE k
(separated from the instructions)
k: .DATA 0
const3: .DATA 3
const2: .DATA 2
temp1: .DATA 0
You may have noticed that the code produced in the example above can be improved to run faster without changing its result. The code was:
LOAD const3
ADD const2
STORE temp1
LOAD temp1
STORE k
k: .DATA 0
const3 .DATA 3
const2 .DATA 2
temp1 .DATA 0
1. Notice that the result of the addition is stored from register R to temp1, then immediately the value from temp1 is loaded back into R. The LOAD operation is unnecessary, so can be eliminated. Resulting code:
LOAD const3
ADD const2
STORE temp1
STORE k
k: .DATA 0
const3 .DATA 3
const2 .DATA 2
temp1 .DATA 0
2. Next, notice that the result of the addition is stored from register R to temp1, then immediately stored again, from register R to k. Since the value stored in temp1 will not be used for anything, it is unnecessary and can be eliminated. Resulting code:
LOAD const3
ADD const2
STORE k
k: .DATA 0
const3 .DATA 3
const2 .DATA 2
3. Finally, notice that both operands of the addition are constants so their sum can be calculated prior to runtime. Create a temporary const5 to hold the result. This eliminates the need to perform the addition. Resulting code:
LOAD const5
STORE k
k: .DATA 0
const5 .DATA 5
All three of these are called local optimization because they
were focused on a small segment of code representing one source statement,
the optimization window.
Global optimization looks at the bigger picture (e.g. more than one statement) and is more difficult to do. Most global optimization techniques focus on loops, since most execution time is spent iterating.
Example of global optimizations in Java:
int j = limit;
for (int i = 0; i < limit-2; i=i+1) {
int extra = 3;
list[i] = i + extra;
item[i] = factor * (i + extra);
j = j – 1;
blah[i] = 4 * j - 3;
}
1. There is no need to allocate storage for extra and assign 3 to it over and over again. Pull this statement up out of the loop so it is done once. Similarly, there is no need to calculate limit-2 repeatedly, so pull this calculation out of the loop. These are code motion optimizations. Result is:
int j = limit;
int extra = 3;
int t1 = limit-2;
for (int i = 0; i < t1; i=i+1) {
list[i] = i + extra;
item[i] = factor * (i + extra);
j = j – 1;
blah[i] = 4 * j - 3;
}
2. Notice that the expression i+extra is calculated twice. It can’t be moved out of the loop because of i, but can be calculated once and assigned to a temporary. This saves one calculation, and is called common subexpression elimination. Result is:
int j = limit;
int extra = 3;
int t1 = limit-2;
int t2;
for (int i = 0; i < t1; i=i+1) {
t2 = i + extra;
list[i] = t2;
item[i] = factor * t2;
j = j – 1;
blah[i] = 4 * j - 3;
}
3. Here’s another optimization that is not at all obvious, but can be correctly applied. Look at the last two assignments.
j = j – 1;
blah[i] = 4 * j - 3;
where j is initialized to limit. Notice that the value of j is reduced by 1 on each iteration. This means the value of 4*j-3 is reduced by 4 on each iteration. So we can substitute subtraction for multiplication. This is an example of strength reduction. Here’s the result:
int j = limit;
int extra = 3;
int t1 = limit - 2;
int t2;
int t3 = 4 * limit - 3;
for (int i = 0; i < t1; i=i+1) {
t2 = i + extra;
list[i] = t2;
item[i] = factor * t2;
j = j – 1;
t3 = t3 - 4;
blah[i] = t3;
}
The original loop contained 2 multiplications, 6 additions/subtractions, 6 assignments and 1 memory allocation per iteration.
The optimized loop contains 1 multiplication, 4 additions/subtractions, 7 assignments and 0 memory allocations per iteration.
The optimized code requires more space for 3 temporaries, and several additional operations just above the loop.
The higher the value of limit, the better the optimized code
looks!