Assignment 2 (Due Sunday, December 8nd at 13:00)

Table of Contents

1 A reader for Scheme S-exprs

1.1 Description

In this problem you will write the first stage in the pipeline of your Scheme compiler. You should keep the code for this problem, as you will continue to use it in subsequent assignments, all the way up to the final project.

The reader is a parser for sexprs: It reads text from a string, and outputs an Abstract Syntax Tree for sexprs. You will need to use the parsing combinators package from the first assignment to write mutually-referential parsers for the various kinds of sexprs, and use these to define pSexpr, the parser for sexprs. As you recognize different kinds of sexprs, you will need to provide a pack wrapper for the parsers, that will invoke the constructors for the corresponding sexpr classes. That way, pSexpr.match(str) will return an AST for the sexpr the concrete syntax of which is in str, and any remaining characters.

Your parser should be built up similarly to the parser you wrote for molecules, in assignment 1: It should raise a NoMatch exception in case it is not possible to read an sexpr from the head of the input string, and should return a pair of the expression, and the remaining string of unread characters.

1.2 Comments & whitespaces

Your sexpr may contain whitespace characters and comments. Your parser will have to know to skip over these as it constructs the AST for the sexpr it's reading.

1.2.1 Whitespaces

As in the previous assignment, any character less than or equal to the space character, is considered a whitespace for the purpose of this assignment.

1.2.2 Line comments

Line comments start with the semicolon character ; and continue until an end-of-line is reached. The semicolon may appear anywhere on the line, and need not be the first charcter!

This kind of comment is used to document your code.

1.2.3 Sexpr comments

Scheme includes another kind of comment, not so much to document your code as to “hide” the next sexpr without actually removing it from the source file. This kind of “commenting out” is very handy when debugging code.

Sexpr comments start with the characters #; and continue to the end of the following sexpr. There may be no space or comment between the # and the ; characters. This sexpr can be any valid sexpr: A number, a Boolean, a symbol, but more often than not, a deeply-nested expression with balanced parentheses.

Please note that while removing comments is usually “easy”, sexpr comments are non-trivial, because they rely on your ability to recognize what is a valid sexpr!

1.3 The concrete syntax of sexprs

1.3.1 Boolean

There are two Boolean values in Scheme: #f (for false), and #t (for true). I would like your reader to recognize these in a case-insensitive manner, so that #t and #T are one and the same, etc.

1.3.2 Number

Scheme has a rich numerical tower. We will not be supporting the full numerical tower in our compiler, but we do want to experience the polymorphism of Scheme procedures, so we will support two kinds of numbers: integers & fractions.

  1. Integers

    Integers can be positive or negative. They may begin with any number of zeros. They may come with an optional sign, which can either be positive or negative. Here are valid numbers in our language:

    • 1234
    • 01234
    • -1234
    • -012
    • -0
    • +1234
    • +432

    In standard Scheme, numbers may not start with an initial +, but we shall permit this here as an extension.

    Numbers (but not the sign) may be prefixed with 0x, 0X, 0h, 0H, in which case, the number is given in hexadecimal, or base 16. The following are valid hexadecimal numbers:

    • 0x1234
    • -0H432
    • +0x1234
    • 0H432

    Hexadecimal numbers include the digits 0, … , 9, a, … , f, either in uppercase or lowercase, with the letters a, … , f having the values:

    a 10
    b 11
    c 12
    d 13
    e 14
    f 15

    Hexadecimal numbers are an extension to the Scheme standard.

  2. Fractions

    Fractions are exact quantities represented using a numerator and a denominator. The numerator is an integer (either positive or negative), and the denominator is an unsigned integer. These integers can be given in the full generality of the integers in the previous section. The numerator and the denominator are separated by a slash character /. Here are some examples of fractions:

    • 4/5
    • -2/3
    • -0x234abc/37
    • +5/7

    The denominator should not have the numerical value of zero. If it does, your parser should not identify it as a fraction.

1.3.3 Symbol

In an interactive Scheme system, a symbol is represented internally as a hashed string. Our compiler, however, is not an interactive (or online) compiler, but a batch (or offline) one, so the AST for symbols will just contain the string and its length.

In principle, a symbol in Scheme can contain any character. This is the case for symbols that have been created dynamically from strings, using string->symbol. Constant symbols that are read in by the reader obey a stricter syntax. The characters in a symbol may include the following characters:

  • The lowercase letters: a, … , z
  • The uppercase letters: A, … , Z
  • Digits: 0, … , 9
  • Punctuation: !$^*-_=+<>/?

Scheme has traditionally been case-insensitive with respect to symbols, although this has changed in R6RS (the 6th version of the Scheme standard). For the purpose of this course, I would like you to assume that strings are case-insensitive in the following way: Your parser should convert all symbol characters to uppercase. Hence the symbols abc, Abc, aBc, and ABC are all the same object, which is stored internally as ABC.

1.3.4 String

Strings are monomorphic arrays of characters. Syntactically, strings in Scheme are delimited by double quote marks, may contain any character, and span across several lines. Here are some examples of strings:

"moshe"

"a string"

"This is a very long
string that spills across
several lines."
  1. String meta-chars

    Some characters cannot appear in a string without being preceeded by a special backslash prefix. This is the exact same situation as in C, C++, Java, Python, and many other programming languages. The list of meta-characters you need to support are:

    Char ASCII/Unicode Concrete syntax
    newline ASCII 10 \n
    return ASCII 13 \r
    tab ASCII 9 \t
    formfeed ASCII 12 \f
    backslash ASCII 93 \\
    double quote ASCII 34 \"
    lambda UTF8 0x03bb \l

    The last meta-character, lambda was added as our own extension to the language.

1.3.5 Char

Characters are denoted by the character prefix #\. There can be no whitespace or comments between the the # & \ characters. What follows the prefix falls under one of several categories:

  1. Named chars

    Some characters are denoted by their full name:

    Char ASCII/Unicode Concrete syntax
    newline ASCII 10 #\newline
    return ASCII 13 #\return
    tab ASCII 9 #\tab
    formfeed ASCII 12 #\page
    lambda UTF8 0x03bb #\lambda

    The last named character, lambda was added as our own extension to the language.

    Named characters are case insensitive, so that #\page, #\Page, and #\PAGE are all denote the same character.

  2. Hexadecimal chars

    For some characters, their position, either in the ASCII or Unicode character sets, would be simpler to note than their name. Such characters can be entered using the hexadecimal character prefix #\x followed by either two or four hexadecimal digits in case-insensitive, hexadecimal notation. Here are some example:

    • #\x03bb is another way of writing #\lambda (λ)
    • #\x05D0 is the letter alef (א)
    • #\xFDFA is the Unicode Arabic glyph ‎ﷺ‎ (‏صلى الله عليه وسلم‎)
  3. Visible chars

    Characters that are in the visible range (i.e., have a larger ASCII or Unicode value than 32) can be entered “as-is”, with the character prefix. Here are some examples:

    • #\א
    • #\ༀ
    • #\☺
    • #\M
    • #\?

    Notice that visible characters are case-sensitive, so that #\a & #\A are distinct.

1.3.6 Nil

Nil, or the empty list, is simply a matching pair of parentheses: (). These may enclose whitespaces and comments.

1.3.7 Pair

The concrete syntax for pairs respects the two dot rules, generating proper lists and improper lists. The formal grammar for these is as follows:

\begin{eqnarray*} \left\langle\mathrm{ProperList}\right\rangle & ::= & \mathtt{(} ~ \left\langle\mathrm{Sexpr}\right\rangle^{*} ~ \mathtt{)} \\ \left\langle\mathrm{ImproperList}\right\rangle & ::= & \mathtt{(} ~ \left\langle\mathrm{Sexpr}\right\rangle^{+} ~ \mathtt{.} ~ \left\langle\mathrm{Sexpr}\right\rangle ~ \mathtt{)} \end{eqnarray*}

You will need to convert both proper and improper lists to nested instances of Pair.

This is your first production that is recursive: A pair is an sexpr, and can contain other sexprs, including other pairs.

1.3.8 Vector

The grammar for the concrete syntax of vectors is similar to the first production for pairs:

\begin{eqnarray*} \left\langle\mathrm{Vector}\right\rangle & ::= & \mathtt{\#(} ~ \left\langle\mathrm{Sexpr}\right\rangle^{*} ~ \mathtt{)} \end{eqnarray*}

Notice that vectors too are recursive.

1.3.9 Quote-like forms

You need to support 4 quote-like forms:

\begin{eqnarray*} \left\langle\mathrm{Quoted}\right\rangle & ::= & \mathtt{'} \left\langle\mathrm{Sexpr}\right\rangle \\ \left\langle\mathrm{QQuoted}\right\rangle & ::= & \mathtt{`} \left\langle\mathrm{Sexpr}\right\rangle \\ \left\langle\mathrm{UnquotedSpliced}\right\rangle & ::= & \mathtt{,\!\!@} \left\langle\mathrm{Sexpr}\right\rangle \\ \left\langle\mathrm{Unquoted}\right\rangle & ::= & \mathtt{,} \left\langle\mathrm{Sexpr}\right\rangle \\ \end{eqnarray*}

For these, you should generate the sexprs Pair(Symbol(name), Pair(sexpr, Nil()), where name is one of quote, quasiquote, unquote-splicing, and unquote, respectively, and sexpr is the sexpr that follows the quote-like tag.

1.4 Classes for representing sexprs abstractly

You should represent the abstract syntax of sexprs using the following classes:

class AbstractSexpr
class Void(AbstractSexpr)   
class Nil(AbstractSexpr) 
class Boolean(AbstractSexpr) 
class Char(AbstractSexpr) 
class AbstractNumber(AbstractSexpr)
class Integer(AbstractNumber)
class Fraction(AbstractNumber)
class String(AbstractSexpr)
class Symbol(AbstractSexpr)
class Pair(AbstractSexpr)
class Vector(AbstractSexpr)

Notice that while we haven't mentioned void in the discussion of the concrete syntax for sexprs, we are representing it in the abstract syntax. This is because void is one of the valid constant data in Scheme, so we need to have it along with integer, pair, etc. There is no way, however, to input the constant void, so it is not generated by any of the parsers for sexprs. Void will be used in the next problem.

The AbstractSexpr class should define the __str__ method, so that the str(…) procedure in Python will know how to print sexprs.

The AbstractSexpr class should define the static method readFromString that takes a string, and returns a pair of an sexpr and the string of the remaining characters.

1.5 How to test your parser

Once you have implemented __str__, testing the reader is simple: Apply str(…) to the sexpr constructed by your reader, and then compare it to the original text in the input string. Both should be equal? under Scheme. If they are not, then something is wrong. In other words, you can use your favourite Scheme system to check that your reader is correct.

1.6 How to organize your code

  • Create a directory by the name scheme-compiler. You will be using this directory throughout the semester, so make sure to keep it neat.
  • Place the latest copy of pc.py in scheme-compiler.
  • The code containing the classes for the abstract syntax of sexprs should be placed in sexprs.py in the scheme-compiler directory.
  • The parser code should go into reader.py in the scheme-compiler directory.

2 A tag-parser for Scheme

2.1 Description

2.2 The concrete syntax of Scheme code

2.2.1 Core forms

  1. Constants

    Constants come in two forms: quoted and unquoted. The field of any quoted form is a constant. Self-evaluating forms (Booleans, chars, numbers, strings) are constants too, even if they haven't been quoted.

  2. Variables

    The concrete syntax of variables is given as unquoted symbols that are not reserved words. For each variable, you should generate an instance of Variable.

  3. Conditionals

    You should support both the if-then & if-then-else forms of conditions in Scheme. If-then forms expand into if-then-else forms, where the else field has the value Constant(Void()).

  4. Lambda Expressions

    There are 3 kinds of λ-expressions in Scheme, simple, which are represented using instances of LambdaSimple, with optional arguments, which are represented using instances of LambdaOpt, and variadic, which are represented using instances of LambdaVar.

  5. Applications

    These are represented using instances of Applic.

  6. Disjunctions

    Disjunctions are simply or-expressions. Recall that we shall be supporting or-expressions as a core form, while macro-expanding and-expressions.

  7. Definitions

    There are two ways to write definitions in Scheme: The basic way, and what I call “the MIT-syntax for define”, which is used to define procedures, and which appears throughout the book The Structure and Interpretation of Computer Programs.

    • Simple define-expressions are of the form (define <name> <expr>). The <name> is the variable name, and the <expr> is the expression the value of which is to be assigned to the given variable. Both the variable name and the expression need to be parsed, giving two values that are instances of AbstractSchemeExpr (the first of which is always a Variable). These two values are packaged as an instances of Def.
    • MIT-style define-expressions are of the form (define(<name> . <argl>) <expr>), where
      • <name> is the name of the variable
      • <argl> represents the list of parameters
      • <expr> is an expression.

    The MIT-style define-expression should be macro-expanded into an ordinary define-expression, and then tag-parsed recursively. The above meta-expression expands into: (define <name> (lambda <argl> <expr>)).

    • If <argl> is a symbol, then the entire λ-expression is variadic.
    • If <argl> is a proper list, then the entire λ-expression is simple.
    • If <argl> is an improper list, then the entire λ-expression is one with optional arguments.

2.2.2 Syntactic sugar

  1. Quasiquoted expressions

    Upon recognizing a quasiquoted-expression, you should expand the expression, and call the tag-parser recursively over the expanded form.

    A Scheme implementation for a quasiquote-expander for non-nested quasiquoted-expressions will be posted onto Mayer Goldberg's website, just under the assignment statement.

  2. cond

    You should expand cond-expressions into nested if-expressions. For example, the expression

    (cond (T1 E1)
          (T2 E2)
          (T3 E3)
          (T4 E4)
          (else E))
    

    should be expanded into:

    (if T1 E1 (if T2 E2 (if T3 E3 (if T4 E4 E))))
    

    The tag-parser should be called recursively over the expanded form.

    Please keep in mind that the else keyword is not necessary. A cond without an else returns Const(Void()) if none of the tests match.

  3. let

    Expand these into applications, and call the tag-parsed recursively.

  4. let*

    Expand these into nested let-expressions, and call the tag-parser re- cursively.

  5. letrec

    The letrec form should only be used to define procedures. The expression

    (letrec ((g1 <LE1>)
    	 (g2 <LE2>)
    	 ...
    	 (gn <LEn>))
      <expr>)
    

    where <LE1>, <LE2>, etc, are λ-expressions, and <expr> is some instance of AbstractSchemeExpr, should be macro-expanded into the form:

    (Yag
     (lambda (g0 g1 g2 ... gn) <expr>)
     (lambda (g0 g1 g2 ... gn) <LE1>)
     (lambda (g0 g1 g2 ... gn) <LE2>)
     ...
     (lambda (g0 g1 g2 ... gn) <LEn>))
    

    where g0 is a fresh variable name, and Yag is the name of a global variable the value of which you will get towards the final stage in your compiler pipeline. For the time being, it is just the global name of some semi-mysterious procedure…

    The tag-parser should then be called recursively on the expanded form.

  6. and

    Expand these into nested if-expressions, and call the tag-parsed recursively.

2.3 Classes representing the abstract syntax of Scheme code

class AbstractSchemeExpr
class Constant(AbstractSchemeExpr)
class Variable(AbstractSchemeExpr)
class IfThenElse(AbstractSchemeExpr)
class AbstractLambda(AbstractSchemeExpr)
class LambdaSimple(AbstractLambda) 
class LambdaOpt(AbstractLambda) 
class LambdaVar(AbstractLambda) 
class Applic(AbstractSchemeExpr)
class Or(AbstractSchemeExpr)
class Def(AbstractSchemeExpr)

2.4 Functionality

2.4.1 Parsing

The class AbstractSchemeExpr should have a static method parse, which takes a string, invokes the reader, and tag-parses the result. The parse method should return a pair of a parsed and macro-expanded Scheme-expression, and the remaining string of text.

2.4.2 Converting to strings

Concrete instances of subclasses of AbstractSchemeExpr should implement the __str__ method, so that the procedure str( ... ) will operate on Scheme expressions. The output should be a string representing the concrete syntax of the macro-expanded form. This means that

>>> expr, remaining = AbstractSchemeExpr.parse('(lambda (a b c) (a (b c))'))
>>> remaining
''
>>> str(expr)
'(lambda (a b c) (a (b c))'

We will post additional examples later on.

2.5 How to test your tag-parser

Testing your parser is actually rather simple:

  1. Start with a Scheme expression the value of which you know (because you can compute it in Chez Scheme).
  2. Call the string containing the concrete representation for the above expression by the name A.
  3. invoke the AbstractSchemeExpr.parse method on A.
  4. Use str( ... ) to get a string representation of the macro-expanded & parsed form. Call the resulting string B.
  5. A & B should give the same value when typed in the same Scheme system (where “same” is in the sense of equal? returning #t for both).

2.6 How to organize your code

  • Add the file tag_parser.py to the existing scheme-compiler directory.

3 Generating a Scanner for a Set-based Data Structure Language

This problem is given by Roman Manevich.

The goal of this exercise is to implement a scanner for a language called ADTL (for Abstract Data Type Language), which we describe next.

Definition of ADTL

An ADTL program defines a data type in a similar way to a class in most object-oriented languages, in addition to features such as: sets are first-class types, locking statements, and path expressions. A variation of ADTL was used to guide an advanced static analysis in finding optimization opportunities for graph algorithms with optimistic parallelization.

Here is a link to an example ADTL program, which defines a graph data type.

Lexical Considerations.

  • Identifiers start with an underscore or a letter and may continue with an underscore, letter, or a digit.
  • Integer literals do not have extra leading 0's. Examples of legal literals are \0" and \-5". An example of an illegal literal is \007".
  • String literals may span at most one line and may contain escape sequences for whitespace, new line, tab, backslash, and quotation marks.
  • ADTL supports single-line comments and multi-line comments as in other languages such as C++ and Java.

The following table lists all token types in ADTL, associating a (Java constant name) with a description of each one:

Description Constant (in sym.java)
/ DIVIDE
{ LCBR
<= LTE
rev REV
; SEMI
integer constants INT
- MINUS
assert ASSERT
| BAR
! NOT
< LT
choose CHOOSE
in IN
( LP
, COMMA
lock LOCK
) RP
+ PLUS
quoted strings QUOTE
= ASSIGN
if IF
identifiers ID
. DOT
} RCBR
&& LAND
return RETURN
new NEW
+= ASSIGN_PLUS
|| LOR
!= NEQ
-= ASSIGN_MINUS
== EQ
= GTE
* TIMES
: COLON
else ELSE
@ AT
> GT
set SET

Grammatical Considerations.

Although it is not used in this assignment, the full grammar of ADTL is shown below (in the next assignment you will write a parser for this grammar).

We use the following notational conventions:

  • Non-terminals start with a capital letter and appear in boldface. For example: Stmt.
  • Terminals appear between single quotation marks. For example '||' is the terminal used for logical or.
  • The terminals for identifiers, integer constants and quoted strings appear as id, int, and quote.
  • We use the Kleene star to denote possibly empty sequences. For example Stmt* is a sequence of 0 or more statements. Similarly, the + version is used to denote non-empty sequences. So, Stmt+ is a sequence of 1 or more statements.
  • We denote a possibly-empty sequence of elements separated by a token by the notation Element⊗token and non-empty sequences by Element⊕token. Examples are Arg⊗',', which denotes a sequence of 0 or more arguments separated by commas, and PathElement⊕'.', which denotes a sequence of 1 or more path elements separated by dots.

 id             Identifier token
 int            Integer token
 quote          Quoted string token
 
 ADT → id '{' Element* '}'
 ElementFieldDef';' | MethodDef
 FieldDef → id ':' Type
 Type → id | 'set'
 MethodDef → id '(' Arg⊗',' ')' '{' Stmt* '}'
 Arg → id ':' Type

Stmt → 'return' Expr';'
        | Assign
        | 'if' '('Expr')' SingleOrBlockStmt
        | 'if' '('Expr')' SingleOrBlockStmt 'else' SingleOrBlockStmt
        | 'assert' Expr';'
        | 'assert' Expr ':' quote';'
        | 'lock' PathExpr';'
        
AssignPathExpr '=' Expr
AssignPathExpr '+=' Expr
AssignPathExpr '-=' Expr
SingleOrBlockStmtStmt | '{' Stmt+ '}'

Expr → 'choose' Expr
        | int
        | '|' Expr '|'
        | 'new' Type '(' AssignedArg* ')'
        | '-' Expr
        | Expr '+' Expr
        | Expr '*' Expr
        | Expr '-' Expr
        | Expr '/' Expr                        
        | Expr '<' Expr
        | Expr '<=' Expr
        | Expr '>' Expr
        | Expr '>=' Expr
        | Expr '==' Expr
        | Expr '!=' Expr
        | '!' Expr
        | Expr '&&' Expr
        | Expr '||' Expr
        | Expr 'in' Expr
        | PathExpr
        | '(' Expr ')'

AssignedArg → id '=' Expr
        
PathExprPathElement⊕'.'
PathElement → id ['@'] | 'rev' '(' id ['@'] ')'
 

What to Implement

You will implement the scanner using the JFlex scanner generation tool.

As part of this exercise, you are given a skeleton implementation, which already contains classes that implement the following: lexical error exceptions; a token class; a driver class for running the scanner on input files; and an Ant build.xml file to help you automate tasks such as generating the scanner, compiling the sources, cleaning, etc. Using the build file and Ant is completely optional and up to you (we will not consider it part of your submission).

The lib sub-directory already contains the JFlex.jar, which you can use to generate the scanner.

You should complete the ADTL.lex file with all the token definitions and other necessary content in order to implement the scanner for ADTL. You should also include one constant per token type in sym.java.

Running JFlex on ADTL.lex should generate a file Lexer.java, which should be compiled with the rest of the sources. The Main.java class uses the scanner to read an ADTL input file and print out the tokens appearing there, one token per line. It should be invoked using the following format: java ADTL.Main -tokens file-name where file-name is the name of the input ADTL file.

Your scanner should also detect and report any lexical analysis errors it may encounter. Whenever the program encounters a lexical error, the scanner must throw a LexicalError exception with a meaningful description of the error (the existing implementation already adds the fact that this is a lexical error and the line number and text). Your program must always report the first lexical error in the file.

The test sub-directory of the code given to you contains an example of an input and corresponding output of the scanner.

Testing the Scanner. We expect you to perform your own testing of the scanner. You should develop a thorough test suite that tests all legal tokens and as many lexical errors as you can think of. We will test your scanner against our own test cases, including programs that are lexically correct, and also programs that contain lexical errors. You may find it useful to go over the ASCII table to make sure your scanner doesn't miss cases. Make sure you handle comments, quoted strings, and escape sequences appropriately.

Other Tools. You may consider the automation of this process using make files, shell scripts, or other similar tools. We recommend using the Apache Ant tool, which is a kind of a platform-independent make utility (JFlex and JavaCup come with Ant tasks). We highly recommend using the Eclipse IDE. Eclipse has many useful features such as code navigation, text completion, unit testing, and debugging. All of these can significantly help increase your productivity in this project.

What to Turn In

You should submit your ADTL.lex file. You should also submit the test suite programs you used to test your scanner on.

4 How to submit

  • You should submit a zip file hw2.zip that creates a directory hw2 containing:
  • scheme-compiler
    • pc.py
    • sexprs.py
    • reader.py
    • tag_parser.py
  • problem-3
    • ADTL.lex
    • Any additional test suite programs you used to test your scanner on.
  • A file readme.txt that contains the names, IDs, and email addresses of all group members (you may either work alone, or with a single partner), as well as the following statement:

I(We), <fill in your name(s)>, declare the following to be true:

  • We worked on this assignment alone. We did not consult with others about this assignment, other than the teaching staff for this course.
  • We did not copy code either from other students or from the internet.
  • We did not make available this code to other students taking this course.
  • We are aware of the University, faculty (as specified in the ‏שנתון‎), departmental, and course policies (as specified in the syllabus) on academic dishonesty: We realize that if we are found to have committed academic dishonesty that we shall be sent before the disciplinary committee (‏ועדת משמעת‎), fail the course, and possibly incur other penalties as decided by the disciplinary committee.

We will not grade any work submitted that does not include the above statement. Please do not forget to include it, with your names spelled out. If you do not do this, for any reason, then at the very least, you will incur the 5% per diem late penalty (as specified in the course syllabus).

4.1 Last minute instructions

  • Unlike the first assignment, we are now asking you to include the parsing combinators package pc.py with your work.
  • PLEASE do make sure that your work runs on the departmental Linux machines. It will save you much grief later on!
  • PLEASE do go over the submission instructions carefully, and make sure you did not leave out anything.

Author: Mayer Goldberg

Created: 2013-11-26 Tue 20:19

Validate XHTML 1.0