In this assignment, you'll write a scanner for a hypothetical language called NotJava. You'll be using a Java-based scanner generator called JFlex for the majority of your solution to this assignment.
JFlex: a scanner generator for Java
JFlex is a scanner generator, which takes as input a script that uses regular expressions to describe a set of lexeme patterns, then generates a Java class implementing a scanner that recognizes those patterns. Associated with each pattern is an action, which is a chunk of Java code; each time the scanner recognizes a pattern, the associated action is executed. JFlex generates an efficient scanner, saving you the trouble of designing and implementing a DFA by hand or, worse yet, hand-coding the scanner from scratch.
Understanding JFlex requires a rough understanding of the class that it will generate, as well as an awareness of how various settings can be modified so that the class will behave as you'd like it to. The following sections seek to give you this understanding.
The general structure of a JFlex script
JFlex scripts follow this general structure:
User code %% Options and declarations %% Lexical rules
The three sections of the script are separated by %% appearing by itself on a line.
The user code section contains code that appears outside of the scanner class; most typically, this will contain import or package declarations, but could include an additional class if you'd like. It is better design to put additional classes into separate .java files. Remember that all of the code related to your scanner doesn't have to go into the JFlex script. The purpose of the JFlex script is just to generate the class that represents the scanner. The code in this section will appear above the generated scanner class in the .java file generated by JFlex.
The options and declarations section contains three things:
Here's a brief example of an options and declarations section:
%class MyScanner // sets the name of the scanner class to MyScanner %public // makes the scanner class public %type Token // sets the return type of the scanner method to Token %function getNextToken // sets the name of the scanner method to getNextToken %{ // This code is copied into the scanner class verbatim. Use it to add // additional fields and methods to the scanner class, if necessary. private int anExtraField; public int aHelperMethod() { return anExtraField; } %} // These are macros for regular expressions, which can be used in the next // section for building bigger regular expressions. Why you would want to // use these is basically the same reason why you want to have named constants // in any program: for readability and writability! LineTerminator = \r | \n | \r\n Whitespace = {LineTerminator} | [ \t\f]
The example above obviously isn't precisely what you want for this assignment, but gives you an idea of what this section is used for.
The lexical rules section is the heart of the scanner. In this section, you specify a regular expression that describes a type of lexeme that you'd like to recognize, along with an action that should be executed whenever a matching lexeme is found. Most of the time, you'll want your actions to contain a return statement in them; that statement will return a value from the scanner method. (In the case of patterns that indicate text that should be ignored, such as whitespace and comments, you may want the action to be empty instead.) It stands to reason, then, that the type of value you return in your action should be compatible with the return type you specified in the options section.
Here are a few examples of lexical rules:
"public" { return new Token(Token.PUBLIC, yytext()); } "protected" { return new Token(Token.PROTECTED, yytext()); } "private" { return new Token(Token.PRIVATE, yytext()); } [a-z]+ { return new Token(Token.IDENTIFIER, yytext()); }
In the examples above, the action associated with each rule specifies the return of a Token, which is a hypothetical object that represents one token in a programming language. One way to design a scanner is to have it return token/lexeme pairs, which is what this example does; each Token is constructed from a token type (which is assumed to be an int constant defined in the Token class, such as Token.IDENTIFIER) and the lexeme that was recognized. The recognized lexeme can be accessed by calling the yytext() method, as above.
What JFlex does with a script
Given an input script, JFlex generates one corresponding .java file that is primarily made up of a class that implements your scanner. By default, the class has the name Lexer, though you can change its name to anything you want by specifying a value for the %class option in your script. I also advise making the class public, by specifying the %public option as well.
The heart of the scanner class is a method that returns the next token from the input stream. You can control the specification of this method, including its name and return type, by specifying options. Here, again, is an example of the relevant options:
%type NotJavaToken %function getNextToken
...which will give the "next token" method the following signature:
public NotJavaToken getNextToken() throws java.io.IOException
You can also control other aspects of this method's signature, including what exceptions it throws and what it does when the end of the input has been reached. If you're interested in more details about that, refer to the JFlex user manual.
The primary reason why you need this kind of control over the signature of the scanning method is that a scanner is just a small piece of a compiler. Its central job is to return the next token in the input whenever asked. The next stage of compilation, the parser, will continually ask the scanner for the next token as it does its work. Making sure that the scanner and parser can integrate together correctly is vitally important, especially if you're using an automated parser generator tool.
The scanning method will work its way through the input from wherever it left off previously, looking for an occurrence of one of the patterns that you specified. It will always match the longest occurrence of one of your patterns that it can find. For example, if you've specified these two patterns:
for // the "for" keyword [a-z]+ // one or more lowercase letters
...if the remaining input begins with format followed by whitespace, JFlex will match format to the second pattern, even though it could have matched for to the first pattern (with mat left over). The reason for this is simple: it's how we expect our programming languages to behave. For example, if we wrote the following statement in a Java program:
returnx;...we shouldn't be the slightest bit surprised when we get a compiler error that says something like "Hey, you never declared a variable called returnx!" We would be crazy to expect that the compiler would discover that this statement could really be interpreted to mean "return x". (Some older programming languages, most notably Fortran, treated whitespace as optional in this way, but such rules are now considered to cause more problems than they solve.)
The order in which you specify your patterns also turns out to be important. Some strings will match more than one pattern. With the two patterns above, for and [a-z]+, the string for actually matches both of them. JFlex resolves such a conflict by always choosing the first pattern you specified. So, if you specified your patterns in this order:
[a-z]+ for
...the string for would be considered an identifier, rather than the for keyword! Exercise caution when you're ordering your patterns.
Creating and using a scanner object
JFlex creates a scanner class with two constructors, one that takes a java.io.Reader as a parameter, and another that takes a java.io.InputStream as a parameter. So, it's easy to create a scanner that reads its input from either a file or System.in by calling the appropriate constructor. Here are two examples of creating a scanner object, assuming a scanner class called MyScanner:
// creates a scanner that reads its input from a file called "filename" MyScanner s1 = new MyScanner(new FileReader("filename")); // creates a scanner that reads its input from System.in MyScanner s2 = new MyScanner(System.in);
Once your scanner object has been created, you can repeatedly call its scanning method until the end of the input is reached. You should leave the default behavior of returning null when there is no more input.
Using JFlex
JFlex is primarily driven via the command line, so to use it, start by bringing up a Command Prompt window.
You can compile a JFlex script (e.g. blah.flex) with the command jflex blah.flex. This will create a Java source file (e.g. Lexer.java). Don't forget to compile this .java file before executing your program!
What is the actual assignment?
Your assignment is to use JFlex to write a scanner for a hypothetical programming language called NotJava. A scanner isn't normally a stand-alone component; it is generally integrated into a program that processes input text for some greater purpose, such as compilation or interpretation, or even the finding and replacing of text. For the time being, however, I'd like your scanner to stand alone, meaning that it will need to be surrounded by a driver program that creates a scanner object, breaks an input file (whose name is specified as a command-line parameter) into token/lexeme pairs, and prints the token/lexeme pairs to System.out as it goes along. Only lexemes that matter should be printed, not those which are evident from the token type. This roughly means that you should print the lexeme associated with tokens like identifiers and string literals, but not keywords or operators.
You might be tempted to place the printing of this output into the scanner itself by embedding it into the action associated with each lexical rule. I'm requiring that your scanner return the token/lexeme pair back to the caller (the driver program) and that the caller does this printing. The reason for this is clear from a software engineering perspective: a properly-designed scanner should be functional in many contexts, meaning that it should break the input into its tokens and lexemes, then return the token/lexeme pairs to the caller for further processing, be it parsing, printing, or whatever else. If we were to continue with this NotJava example by building a parser in a subsequent assignment, you would be able to use your scanner with your parser without any modification.
To make it easier to grade the assignments, I have a couple of requirements about naming:
The effect of these requirements is that we'll be able to compile and run your program using the following commands:
jflex notjava.flex javac *.java java Driver inputfile.nj
...where inputfile.nj is the name of a file containing a NotJava program.
The NotJava language
The NotJava language is a simple imperative-style language with two basic data types (integers and strings), functions with zero or more arguments and one return type, procedures with zero or more arguments and no return type, reasonably straightforward mathematical, relational, and logical operations, and a few kinds of imperative-style statements (e.g. assignment statements, an "if" statement, and one kind of loop). Of course, to build a scanner, it's not necessary to know the complete specification of a programming language (though, in some cases, it helps), so I will stick to a brief example program and a lexical specification (i.e. what each kind of token looks like, without regard for how these tokens are put together to write a NotJava program).
An example program
Here's an example NotJava program that demonstrates many language features:
# the square function returns the square of some integer function square (integer n) -> integer: [ if n == 0 then Result <- 0 # the Result "variable" holds the function's # return value else [ var integer x; # declares a variable x x <- n * n; Result <- x ] ] # the program procedure is where the program begins its execution, # somewhat akin to the main() method in Java procedure program(): [ var integer a, b; a <- square(3); b <- square(a); print(a); var string s; s <- " squared is "; print(s); print(b) ]
Lexical specification of NotJava
A NotJava program is made up of a sequence of lexemes, each of which can be classified as being of one particular token type. Unless otherwise specified, the lexeme that should be returned by your scanner is the actual chunk of source text, such as 35 or <-. The types of tokens, and the rules that govern their appearance, are as follows:
Sample input and output
Sample input:
# a sample NotJava program procedure program(): [ var integer a; a <- 3 * 3; print("hello "); # notice how comments don't show up in the output print(a) ]
Sample output:
PROCEDURE IDENTIFIER (program) LEFT_PAREN RIGHT_PAREN COLON LEFT_BRACKET VAR IDENTIFIER (integer) IDENTIFIER (a) SEMICOLON IDENTIFIER (a) ASSIGNMENT_OP INTEGER_LITERAL (3) MULTIPLICATION_OP INTEGER_LITERAL (3) SEMICOLON IDENTIFIER (print) LEFT_PAREN STRING_LITERAL (hello ) RIGHT_PAREN SEMICOLON IDENTIFIER (print) LEFT_PAREN IDENTIFIER (a) RIGHT_PAREN RIGHT_BRACKET
Your program's output need not be exactly like mine, but it should convey the same message; on each line, it should show one token and its associated lexeme.
Naturally, it's necessary for your program to work with more than just the sample input. I'll be testing it with several input programs that, together, test all of the lexical specification as described in the previous section. You need not provide error handling; you may assume that the input file consists of a legal NotJava program.
Deliverables
Submit your JFlex script and the .java files that comprise your driver program. You need not submit the .java file created by JFlex; I'll be regenerating it during the grading process anyway. Remember that the name of your JFlex script must be notjava.flex, and that your driver program's main() method must appear in a file called Driver.java!
Limitations
There are a few JFlex features that I would prefer that you didn't use in your assignment. If you don't know what the features on this list are, there's no need for you to find out about them unless you're curious.