COMP 3200 Programming Assignment #3

In this assignment, you'll write a scanner for a hypothetical language called NotJava. You'll be using a Java-based scanner generator called JFlex for the majority of your solution to this assignment.

JFlex: a scanner generator for Java

JFlex is a scanner generator, which takes as input a script that uses regular expressions to describe a set of lexeme patterns, then generates a Java class implementing a scanner that recognizes those patterns. Associated with each pattern is an action, which is a chunk of Java code; each time the scanner recognizes a pattern, the associated action is executed. JFlex generates an efficient scanner, saving you the trouble of designing and implementing a DFA by hand or, worse yet, hand-coding the scanner from scratch.

Understanding JFlex requires a rough understanding of the class that it will generate, as well as an awareness of how various settings can be modified so that the class will behave as you'd like it to. The following sections seek to give you this understanding.

The general structure of a JFlex script

JFlex scripts follow this general structure:

User code

%%

Options and declarations

%%

Lexical rules

The three sections of the script are separated by %% appearing by itself on a line.

The user code section contains code that appears outside of the scanner class; most typically, this will contain import or package declarations, but could include an additional class if you'd like. It is better design to put additional classes into separate .java files. Remember that all of the code related to your scanner doesn't have to go into the JFlex script. The purpose of the JFlex script is just to generate the class that represents the scanner. The code in this section will appear above the generated scanner class in the .java file generated by JFlex.

The options and declarations section contains three things:

The specification of various options, such as the name of the scanner class, the return type and name of its "next token" method, etc. A full list of options is listed in the JFlex user manual. You shouldn't need any options other than the ones specified in the example below.
A declaration of additional code that is placed into the scanner class. If you need to add fields or helper methods into the scanner class, this is where you put them.
Declarations of regular expression macros, which establish shorthands for commonly-appearing subexpressions, or complex subexpressions that will make more sense if given a name.

Here's a brief example of an options and declarations section:

%class MyScanner        // sets the name of the scanner class to MyScanner
%public                 // makes the scanner class public
%type Token             // sets the return type of the scanner method to Token
%function getNextToken  // sets the name of the scanner method to getNextToken

%{
    // This code is copied into the scanner class verbatim.  Use it to add
    // additional fields and methods to the scanner class, if necessary.

    private int anExtraField;
    
    public int aHelperMethod()
    {
        return anExtraField;
    }
%}

// These are macros for regular expressions, which can be used in the next
// section for building bigger regular expressions. Why you would want to
// use these is basically the same reason why you want to have named constants
// in any program: for readability and writability!

LineTerminator = \r | \n | \r\n
Whitespace = {LineTerminator} | [ \t\f]

The example above obviously isn't precisely what you want for this assignment, but gives you an idea of what this section is used for.

The lexical rules section is the heart of the scanner. In this section, you specify a regular expression that describes a type of lexeme that you'd like to recognize, along with an action that should be executed whenever a matching lexeme is found. Most of the time, you'll want your actions to contain a return statement in them; that statement will return a value from the scanner method. (In the case of patterns that indicate text that should be ignored, such as whitespace and comments, you may want the action to be empty instead.) It stands to reason, then, that the type of value you return in your action should be compatible with the return type you specified in the options section.

Here are a few examples of lexical rules:

"public"               { return new Token(Token.PUBLIC, yytext()); }
"protected"            { return new Token(Token.PROTECTED, yytext()); }
"private"              { return new Token(Token.PRIVATE, yytext()); }
[a-z]+                 { return new Token(Token.IDENTIFIER, yytext()); }

In the examples above, the action associated with each rule specifies the return of a Token, which is a hypothetical object that represents one token in a programming language. One way to design a scanner is to have it return token/lexeme pairs, which is what this example does; each Token is constructed from a token type (which is assumed to be an int constant defined in the Token class, such as Token.IDENTIFIER) and the lexeme that was recognized. The recognized lexeme can be accessed by calling the yytext() method, as above.

What JFlex does with a script

Given an input script, JFlex generates one corresponding .java file that is primarily made up of a class that implements your scanner. By default, the class has the name Lexer, though you can change its name to anything you want by specifying a value for the %class option in your script. I also advise making the class public, by specifying the %public option as well.

The heart of the scanner class is a method that returns the next token from the input stream. You can control the specification of this method, including its name and return type, by specifying options. Here, again, is an example of the relevant options:

    %type NotJavaToken
    %function getNextToken

...which will give the "next token" method the following signature:

    public NotJavaToken getNextToken() throws java.io.IOException

You can also control other aspects of this method's signature, including what exceptions it throws and what it does when the end of the input has been reached. If you're interested in more details about that, refer to the JFlex user manual.

The primary reason why you need this kind of control over the signature of the scanning method is that a scanner is just a small piece of a compiler. Its central job is to return the next token in the input whenever asked. The next stage of compilation, the parser, will continually ask the scanner for the next token as it does its work. Making sure that the scanner and parser can integrate together correctly is vitally important, especially if you're using an automated parser generator tool.

The scanning method will work its way through the input from wherever it left off previously, looking for an occurrence of one of the patterns that you specified. It will always match the longest occurrence of one of your patterns that it can find. For example, if you've specified these two patterns:

    for        // the "for" keyword
    [a-z]+     // one or more lowercase letters

...if the remaining input begins with format followed by whitespace, JFlex will match format to the second pattern, even though it could have matched for to the first pattern (with mat left over). The reason for this is simple: it's how we expect our programming languages to behave. For example, if we wrote the following statement in a Java program:

    returnx;

...we shouldn't be the slightest bit surprised when we get a compiler error that says something like "Hey, you never declared a variable called returnx!" We would be crazy to expect that the compiler would discover that this statement could really be interpreted to mean "return x". (Some older programming languages, most notably Fortran, treated whitespace as optional in this way, but such rules are now considered to cause more problems than they solve.)

The order in which you specify your patterns also turns out to be important. Some strings will match more than one pattern. With the two patterns above, for and [a-z]+, the string for actually matches both of them. JFlex resolves such a conflict by always choosing the first pattern you specified. So, if you specified your patterns in this order:

    [a-z]+
    for

...the string for would be considered an identifier, rather than the for keyword! Exercise caution when you're ordering your patterns.

Creating and using a scanner object

JFlex creates a scanner class with two constructors, one that takes a java.io.Reader as a parameter, and another that takes a java.io.InputStream as a parameter. So, it's easy to create a scanner that reads its input from either a file or System.in by calling the appropriate constructor. Here are two examples of creating a scanner object, assuming a scanner class called MyScanner:

    // creates a scanner that reads its input from a file called "filename"
    MyScanner s1 = new MyScanner(new FileReader("filename"));

    // creates a scanner that reads its input from System.in
    MyScanner s2 = new MyScanner(System.in);

Once your scanner object has been created, you can repeatedly call its scanning method until the end of the input is reached. You should leave the default behavior of returning null when there is no more input.

Using JFlex

JFlex is primarily driven via the command line, so to use it, start by bringing up a Command Prompt window.

You can compile a JFlex script (e.g. blah.flex) with the command jflex blah.flex. This will create a Java source file (e.g. Lexer.java). Don't forget to compile this .java file before executing your program!

What is the actual assignment?

Your assignment is to use JFlex to write a scanner for a hypothetical programming language called NotJava. A scanner isn't normally a stand-alone component; it is generally integrated into a program that processes input text for some greater purpose, such as compilation or interpretation, or even the finding and replacing of text. For the time being, however, I'd like your scanner to stand alone, meaning that it will need to be surrounded by a driver program that creates a scanner object, breaks an input file (whose name is specified as a command-line parameter) into token/lexeme pairs, and prints the token/lexeme pairs to System.out as it goes along. Only lexemes that matter should be printed, not those which are evident from the token type. This roughly means that you should print the lexeme associated with tokens like identifiers and string literals, but not keywords or operators.

You might be tempted to place the printing of this output into the scanner itself by embedding it into the action associated with each lexical rule. I'm requiring that your scanner return the token/lexeme pair back to the caller (the driver program) and that the caller does this printing. The reason for this is clear from a software engineering perspective: a properly-designed scanner should be functional in many contexts, meaning that it should break the input into its tokens and lexemes, then return the token/lexeme pairs to the caller for further processing, be it parsing, printing, or whatever else. If we were to continue with this NotJava example by building a parser in a subsequent assignment, you would be able to use your scanner with your parser without any modification.

To make it easier to grade the assignments, I have a couple of requirements about naming:

Your JFlex script must appear in a file called notjava.flex.
Your driver program must have its main() method in a public class called Driver, appearing in a file called Driver.java.
You may include as many other .java files as you'd like.
None of your files should use the package directive.

The effect of these requirements is that we'll be able to compile and run your program using the following commands:

    jflex notjava.flex
    javac *.java
    java Driver inputfile.nj

...where inputfile.nj is the name of a file containing a NotJava program.

The NotJava language

The NotJava language is a simple imperative-style language with two basic data types (integers and strings), functions with zero or more arguments and one return type, procedures with zero or more arguments and no return type, reasonably straightforward mathematical, relational, and logical operations, and a few kinds of imperative-style statements (e.g. assignment statements, an "if" statement, and one kind of loop). Of course, to build a scanner, it's not necessary to know the complete specification of a programming language (though, in some cases, it helps), so I will stick to a brief example program and a lexical specification (i.e. what each kind of token looks like, without regard for how these tokens are put together to write a NotJava program).

An example program

Here's an example NotJava program that demonstrates many language features:

# the square function returns the square of some integer
function square (integer n) -> integer:
[
    if n == 0 then
        Result <- 0        # the Result "variable" holds the function's
                           # return value
    else
    [
        var integer x;     # declares a variable x
        x <- n * n;
        Result <- x
    ]
]


# the program procedure is where the program begins its execution,
# somewhat akin to the main() method in Java
procedure program():
[
    var integer a, b;
    a <- square(3);
    b <- square(a);
    print(a);
    var string s;
    s <- " squared is ";
    print(s);
    print(b)
]

Lexical specification of NotJava

A NotJava program is made up of a sequence of lexemes, each of which can be classified as being of one particular token type. Unless otherwise specified, the lexeme that should be returned by your scanner is the actual chunk of source text, such as 35 or <-. The types of tokens, and the rules that govern their appearance, are as follows:

Identifiers must begin with an uppercase or lowercase letter, followed (optionally) by any number of uppercase letters, lowercase letters, digits, or underscores ('_'). An identifier may not end with an underscore, nor may it contain two or more consecutive underscores.
An integer literal is a sequence of digits, optionally preceded by a '-' character (indicating a negative number).
A string literal is a sequence of characters within double quotes. Unlike in languages like Java, there are no special escape sequences such as \n. There are a couple of rules that need to be specified about string literals, however:
- An end of line may not occur within a string literal. This is always an error. (You need not handle this error in your scanner, but you may if you wish.)
- If you want to put a double quote character into a string literal, you can do so by putting two double quotes in a row. For example, the string literal "My favorite song is ""Where the Streets Have No Name"" today" is a legal NotJava string literal, which represents the following string of text: My favorite song is "Where the Streets Have No Name" today.
It should be noted that the contents of a string literal should be returned as its lexeme. So, in the example above, the lexeme should be My favorite song is "Where the Streets Have No Name" today, without the beginning and ending double quotes, and with the pairs of double quotes within compressed to one each.
The following keywords are reserved in NotJava. Each should be recognized as a separate token type:
- and, do, else, function, if, implies, not, or, procedure, then, var, while, xor
The following are characters or character sequences with special meaning, such as operators and parentheses. Each should be recognized as a separate token type:
- Left and right parentheses: ( )
- Left and right brackets: [ ]
- Arithmetic operators: + - * /
- Relational operators: < <= > >= == /=
- The return type opearator: ->
- The assignment opearator: <-
- String concatenation: &
- Semicolon: ;
- Colon: :
- Comma: ,
Comments begin with # and continue until the end of the line on which they appear. Comments should be ignored by the scanner, meaning that they don't constitute a lexeme.
Whitespace in a NotJava program should be ignored; it does not constitute a lexeme.

Sample input and output

Sample input:

# a sample NotJava program
procedure program():
[
    var integer a;
    a <- 3 * 3;
    print("hello ");   # notice how comments don't show up in the output
    print(a)
]

Sample output:

PROCEDURE
IDENTIFIER (program)
LEFT_PAREN
RIGHT_PAREN
COLON
LEFT_BRACKET
VAR
IDENTIFIER (integer)
IDENTIFIER (a)
SEMICOLON
IDENTIFIER (a)
ASSIGNMENT_OP
INTEGER_LITERAL (3)
MULTIPLICATION_OP
INTEGER_LITERAL (3)
SEMICOLON
IDENTIFIER (print)
LEFT_PAREN
STRING_LITERAL (hello )
RIGHT_PAREN
SEMICOLON
IDENTIFIER (print)
LEFT_PAREN
IDENTIFIER (a)
RIGHT_PAREN
RIGHT_BRACKET

Your program's output need not be exactly like mine, but it should convey the same message; on each line, it should show one token and its associated lexeme.

Naturally, it's necessary for your program to work with more than just the sample input. I'll be testing it with several input programs that, together, test all of the lexical specification as described in the previous section. You need not provide error handling; you may assume that the input file consists of a legal NotJava program.

Deliverables

Submit your JFlex script and the .java files that comprise your driver program. You need not submit the .java file created by JFlex; I'll be regenerating it during the grading process anyway. Remember that the name of your JFlex script must be notjava.flex, and that your driver program's main() method must appear in a file called Driver.java!

Limitations

There are a few JFlex features that I would prefer that you didn't use in your assignment. If you don't know what the features on this list are, there's no need for you to find out about them unless you're curious.

You may not use yypushback.
You may not use lookahead (i.e. trailing context).
You may use line, column, and character counting if you wish, though it does not serve any purpose in this assignment.