Parsing Combinators in Scheme

Table of Contents

1 The basic package: parsing-combinators.scm

Parsing combinators are a technique for implementing top-down, recursive descent parsers in a way that is functional and compositional. The file parsing-combinators.scm implements parsing combinators and a set of supporting procedures for embedding parsers into Scheme. In principle, this is the only file you need to start implementing a parser for your favourite language.

1.1 Basic concepts, and the structure of the parser

A parser is a closure of the form

(lambda (s match fail)
  ... )

The argument s contains a list of tokens. These can be characters or symbols or numbers or any Scheme object. A parser is written in Continuation-Passing Style (CPS), where each parser takes two continuations, match and fail, corresponding to a standard technique of using success & fail continuations to manage two courses of actions simultaneously.

The parser will then attempt to construct an expression out of the tokens at the head of the list s. If successful, the match continuation will be applied to two arguments: the expression & the remaining tokens. If unsuccessful, the fail continuation will be applied with no arguments.

When defining the parser, it is possible to specify procedures that do postprocessing on the expression before it is passed to the match continuation. If you do not specify a postprocessor, the parser will “do the right thing”, which in most cases means using the identity function for a postprocessor.

While parsing combinators are implemented in this package using CPS, you do not need to know or use CPS in order to use this package. Part of the beauty of parsing combinators is that they compose parsers (and are themselves written in CPS), but the expressions the expressions that combine parsing combinators are themselves written in direct style, i.e., do not themselves use CPS.

1.2 How parsing combinators are used to construct parsers

Parsing combinators are higher-order functions that compose parsers for smaller languages into parsers for larger languages. The idea is to start with constant parsers, that recognize single tokens, and then combine them using catenation, disjunction and the Kleene Star into larger parsers.

For example, the parser for recognizing the string "fubar" is defined as a catenation of the parsers that recognize the individual characters #\f, #\u, #\b, #\a, & #\r.

Accordingly, the procedures that form the basis for the parsing combinator package are const – for defining constants, caten – for returning the catenation parsers, disj – for returning the disjunction of parsers, and star – for returning the Kleene Star of a given parser.

1.3 The main entry points

1.3.1 test

The procedure test is provided so as to simplify testing and debugging of parsers that were written using parsing combinators. It takes 2 arguments:

  1. A parser
  2. A list of tokens

It returns a “report”, which means either the symbol no-match!, in case the parser was not able to construct an expression from the first tokens in the list of tokens, or a list that details the expression the parser had generated and the remaining tokens.

The procedure test is not meant as a delivery vehicle for the parser, but rather just for testing. Examples of how test is used will appear throughout this page.

1.3.2 const

The const procedure is how parsers are defined that recognize the simplest expressions – usually at the level of individual characters. For the sake of generality, const takes a predicate match? and an optional postprocessor cb. If the list of tokens is non-empty & its head satisfies match?, then the head is passed on to the postprocessor, and whatever is returned by the postprocessor is passed onto the success continuation match, along with the remainder of the list of tokens. If the list of tokens is empty, or match? is not satisfied, then the fail continuation fail is called.

Here are some examples:

The parser p1 recognizes the number 37 (how boring is that??). We first call it with a list that does not start with 37, and then with a list that does:

> (define p1 (const (lambda (x) (and (number? x) (= x 37)))))
> (test p1 '(1 2))
> (test p1 '(37 1 2))
((expression: 37) (tokens left: (1 2)))

The parser p2 also recognizes the number 37. It is different from p1 in that we provide a postprocessor that takes the argument 37 and replaces with it with the symbol thirty-seven. The difference in the behaviour of p2 should be clear:

> (define p2 (const (lambda (x) (and (number? x) (= x 37))) (lambda (_) 'thirty-seven)))
> (test p2 '(1 2))
 > (test p2 '(37 1 2))
((expression: thirty-seven) (tokens left: (1 2)))

1.3.3 caten

$$\mathtt{caten}(\mathit{parser}_{G_1}, \ldots, \mathit{parser}_{G_n}) = \mathit{parser}_{(G_1 \cdots G_n)} $$

The caten procedure takes any number of parsers as arguments (it is a variadic procedure), and returns their catenation. For example, if p1 recognizes the number 1, p2 recognizes the number 2, & p3 recognizes the number 3, then the catenation of p1, p2, & p3 is a parser that recognizes the numbers 1 2 3, in order, in the start of the list of tokens.

> (define number
    (lambda (n)
       (lambda (x)
         (and (number? x)
              (= x n))))))
> (define p4 ((caten (number 1) (number 2) (number 3))))
> (test p4 '(3 2 1 4))
> (test p4 '(1 2 3 4))
((expression: (1 2 3)) (tokens left: (4)))

Notice that in the above example, caten has two pairs of parenthesis around it, that it, the application of caten returns a closure that is applied again. Because caten takes any number of parsers as arguments, and because we wish to be able to specify an additional, optional postprocessor, the simplest solution is to define caten as two nested, variadic procedures: The outermost is applied to the parsers, and returns a variadic closure that can be applied to the optional argument – the postprocessor. In the above example, we did not pass any postprocessor, which is why the second variadic closure was applied without any arguments.

The postprocessor that is used by caten is different from all other postprocessor procedures that are used in this package: All other postprocessors take but one single argument. The postprocessor used by caten takes as many arguments as were concatenated. Here is an example:

> (define p5 
    ((caten (number 1) (number 2) (number 3))
     (lambda (a b c)
       `(the tokens begin with ,a ,b and ,c))))
> (test p5 '(hi mom!))
> (test p5 '(1 2 3 hi mom!))
((expression: (the tokens begin with 1 2 and 3))
  (tokens left: (hi mom!)))

The motivation behind making conj be variadic, as opposed to binary, is not merely to avoid gratuitous nesting & parenthesis: In the algebra of parsing combinators, the empty conjuction, i.e., the conjuction of zero parsers, is the ε-parser, i.e., the parser that recognizes ε (epsilon). This parser does not shorten the list of tokens, and matches the empty list (). So the variadic catenation is both convenient and theoretically-sound, and saves us from having to define yet-another global entry point – the ε-parser (which is used in defining the star procedure).

1.3.4 disj

$$\mathtt{disj}(\mathit{parser}_{G_1}, \ldots, \mathit{parser}_{G_n}) = \mathit{parser}_{(G_1 \mid \cdots \mid G_n)} $$

The disj procedure takes any number of parsers as arguments, and returns their disjunction. Starting with the first parser, it attempts to construct an expression out of the list of incoming tokens. If it succeeds, the parser passed the expression and remaining tokens to the success continuation match; If it fails, it moves on to the next parser. If none of the parsers match an initial sublist of the incoming tokens, then the fail continuation fail is applied. The empty disjunction, i.e., the disjunction of zero parsers, is the parser that invokes the fail continuation for any incoming list of tokens.

As with caten, the procedure disj is a variadic closure that is applied to any number of parsers, and returns a variadic closure that is applied to an optional postprocessor.

Consider the following examples:

> (define p6 ((disj (number 3) (number 5) (number 8))))
> (test p6 '())
> (test p6 '(2 3 4 5))
> (test p6 '(3 4))
((expression: 3) (tokens left: (4)))
> (test p6 '(5 6))
((expression: 5) (tokens left: (6)))
> (test p6 '(8 8 8))
((expression: 8) (tokens left: (8 8)))

1.3.5 star

$$\mathtt{star}(\mathit{parser}_G) = \mathit{parser}_{(G^{*})} $$

The star procedure takes a parser p and an optional postprocessor cb, and returns a parser for the grammar that matches lists of zero or more strings matched by p. The resulting list is then passed to the postprocessor, and the return value of the postprocessor, together with the remaining sublist of tokens, is passed to the success continuation match. The parser (star p) always matches some, possibly empty, list, even if p matches nothing. Consider the following examples:

> (define p7 (star (number 23)))
> (test p7 '(2 3 4 5))
((expression: ()) (tokens left: (2 3 4 5)))
> (test p7 '(23 23 23 23 4 5 6))
((expression: (23 23 23 23)) (tokens left: (4 5 6)))
> (define p8 
      (lambda (ch) 
        (not (and (char-ci<=? #\a ch)
                  (char-ci<=? ch #\z)))))
     ;; pack the punctuation chars:
     (lambda (s) `(punctuation ,(list->string s)))))
> (test p8 (string->list "[1] an item..."))
((expression: (punctuation "[1] "))
    (#\a #\n #\space #\i #\t #\e #\m #\. #\. #\.)))

1.3.6 maybe

$$\mathtt{maybe}(\mathit{parser}_G) = \mathit{parser}_{(G^{?})} = \mathit{parser}_{(\epsilon \mid G)} $$

The maybe procedures takes a parser p and an optional postprocessor cb, and returns a parser for the grammar that matches either zero or one expressions that are matched by p. It is not the argument that is passed onto the postprocessor, but rather a list containing the argument. If no argument is matched, the postprocessor is passed the empty list. It is therefore possible to test for whether there was zero or one arguments simply by testing for null?. Consider the following examples:

> (define p8 (maybe (number 23)))
> (test p8 '())
((expression: ()) (tokens left: ()))
> (test p8 '(2 3 4 5))
((expression: ()) (tokens left: (2 3 4 5)))
> (test p8 '(23 4 5))
((expression: (23)) (tokens left: (4 5)))
> (define p9 (maybe (const null?)))
> (test p9 '())
((expression: ()) (tokens left: ()))
> (test p9 '(()))
((expression: (())) (tokens left: ()))
> (test p9 '(() () ()))
((expression: (())) (tokens left: (() ())))

1.4 Building extensions

2 A scanner & reader for Scheme: parsing-combinators-sexprs.scm

This file contains an implementation of a reader, i.e., a parser for the grammar of S-expressions. Our reader supports some custom extensions to the syntax of S-expression, so as to demonstrate that the grammar we are parsing is really unrelated to the grammar in which the parser is written – a distinction that is very clear when the implemented language and the language in which the implementation is written are distinct, but which is blurred when the two languages are the same.

Neither do we support the full syntax of S-expressions in Scheme. The main omission is in the numerical tower, out of which we only support integers, signed and unsigned, of arbitrary-precisions. It would make for a nice exercise to extend the reader to handle the full numerical tower of Scheme.

2.1 The main entry points

2.1.1 read-sexpr

The procedure read-sexpr is a wrapper around the testing procedure for parsing combinators test. It takes a string, expands it to a list of characters, and passes it to test with the parser <sexpr>. The return value is whatever test returns. Here are some examples:

> (load "parsing-combinators-sexprs.scm")
> (read-sexpr "4")
((expression: 4) (tokens left: ()))
> (read-sexpr "4 5")
((expression: 4) (tokens left: (#\space #\5)))
> (read-sexpr "((a . b) #(() #f #\\A) #;(this sexpr will be removed!))")
((expression: ((a . b) #(() #f #\A))) (tokens left: ()))

Note that since read-sexpr takes a string for an argument, some characters cannot be entered without the use of the backslash character (\), which is the most commonly-used meta-character for strings. This means that within a string the character object #\A must be entered as #\\A. String objects within strings must have their matching double quotes prefixed by a backslash.

Because read-sexpr uses the <sexpr> parser, it attempts to read only one S-expression. Any remaining characters are left un-consumed in the token stream. This single S-expression, however, can be nested arbitrarily, and include subexpressions that are commented out via the S-expression comment prefix #;. Here is a more complex example to show that the string can be spread across several lines and include different kinds of comments:

> (read-sexpr "(;;; This is ignored!
;;; This is another comment line!
first ; this is ignored to the <eoln>!
second #;third
#;(the sexpr comment (can be) (arbitrarily . nested))
((expression: (first second)) (tokens left: ()))

2.1.2 read-sexpr*

Similarly to read-sexpr, the procedure read-sexpr* is a wrapper around the testing procedure for parsing combinators test

> (read-sexpr* "1 2 3 5 8 13")
((expression: (1 2 3 5 8 13)) (tokens left: ()))
> (read-sexpr* "() (()) #;((())) ()")
((expression: (() (()) ())) (tokens left: ()))
> (read-sexpr* "\"first string\"\"second string\"")
((expression: ("first string" "second string"))
  (tokens left: ()))

2.2 Extensions to the syntax of S-expressions

The language we are parsing need not be identical to the language in which we are implementing the parser! In fact, this is one pedagogical weakness of meta-circular interpreters – that they blur the distinction between the implementation language and the language being implemented.

Therefore, we deem it particularly important, for pedagogical purposes, that we implement a language the syntax of which contains various non-standard extensions to Scheme that are not supported by the host system. Exploiting the fact that the current standard now supports Unicode, we extend the syntax of Scheme to include a variety of special characters:

  • The Hebrew letter alef (א) is denoted by the named chars #\alef & #\aleph, and within a string, by the meta-chars \{alef} & \{aleph}.
  • The Hebrew letter bet (ב) is denoted by the named char #\bet, and within a string, by the meta-char \{bet}.
  • The Hebrew letter gimel (ג) is denoted by the named char #\gimel, and within a string, by the meta-char \{gimel}.
  • The smiley character (☺) is denoted by the named char #\smiley, and within a string, by the meta-char \{smiley}.
  • The Tibetan character for the auspicious sound Om (ༀ) is denoted by the named char #\tibetan-om, and within a string, by the meta-char \{tibetan-om}.
  • The glyph (ﷺ), which stands for the Arabic benediction ‏صلى الله عليه وسلم‎ (meaning Peace Be Unto Him) is denoted by the named char #\pbuh, and within a string, by the meta-char \{pbuh}.

Additionally, we extend both characters and strings to support single-byte octal characters, and single & double-byte hexadecimal characters, so in fact, any Unicode character can be represented:

> (read-sexpr "(#\\alef #\\tibetan-om . 
                #(\"Peace Be Unto Him: \\{pbuh} \\{smiley}\"))")
((expression: (#\א #\ༀ . #("Peace Be Unto Him: ﷺ ☺")))
  (tokens left: ()))
> (read-sexpr "(#\\x41 #\\x0fc7 #\\o141 \"\\x0f68 \\o041\")")
((expression: (#\A #\࿇ #\a "ཨ !")) (tokens left: ()))

3 A parser for the λ-calculus: parsing-combinators-lambda.scm

This file contains an implementation of a parser for the λ-calculus. The intention of posting it here, in addition to providing another demonstration of using parsing combinators, is to use the parser as a starting point for proramming projects that concern themselves with the λ-calculus: Reducers of various kinds, etc. In light of this, I strove to make the syntax as inclusive and as general as possible, allowing for many different ways to encode the same λ-expressions.

3.1 Entry points

3.1.1 Parser for expressions

The parser for reading in expressions is <expr>:

> (test <expr> (string->list "λa b.(a (mul b) c1)"))
   (lambda a
     (lambda b
         (applic (var a) (applic (var mul) (var b)))
         (lambda s (var s))))))
  (tokens left: ())) 

The syntax for expressions is rich, and includes automatic Currying, left-associativity for applications, and some syntactic sugar to simplify the encoding of λ-expressions. Consult the section on syntactic sugar for additional details.

3.1.2 parser for definitions

The parser for reading in definitions is <definition>:

> (test <definition> (string->list "K ::= \\x y.x"))
((expression: (define K (lambda x (lambda y (var x)))))
  (tokens left: ()))
> (test <definition> (string->list "B ← \\x y z.(x (y z))"))
   (define B
     (lambda x
       (lambda y
         (lambda z (applic (var x) (applic (var y) (var z))))))))
  (tokens left: ()))
> (test <definition> (string->list "K := \\x y.x"))
((expression: (define K (lambda x (lambda y (var x)))))
  (tokens left: ()))
> (test <definition> (string->list "K :: \\x y.x"))
((expression: (define K (lambda x (lambda y (var x)))))
  (tokens left: ()))
> (test <definition> (string->list "K ≣ \\x y.x"))
((expression: (define K (lambda x (lambda y (var x)))))
  (tokens left: ()))
> (test <definition> (string->list "K == \\x y.x"))
((expression: (define K (lambda x (lambda y (var x)))))
  (tokens left: ()))
> (test <definition> (string->list "K <- \\x y.x"))
((expression: (define K (lambda x (lambda y (var x)))))
  (tokens left: ()))

3.1.3 Parser for files

The parser for reading in complete files is <file>. It reads any number of definitions, load commands, followed by an optional, single expression:

> (test <file> (string->list "

s+ ::= L n s z.(s (n s z))
k ::= L x y . x
f ::= L x y . y
i ::= L x . x
m ::= λx.(x x)
(m m)
#<procedure >>
> ((expression:
   ((define s+
      (lambda n
        (lambda s
          (lambda z
              (var s)
              (applic (applic (var n) (var s)) (var z))))))) (define k (lambda x (lambda y (var x))))
     (define f (lambda x (lambda y (var y))))
     (define i (lambda x (var x))) (load "")
     (define m (lambda x (applic (var x) (var x))))
     (applic (var m) (var m))))
  (tokens left: ()))

3.2 Variables

The syntax for variables roughly follows the syntax of symbols in Scheme. A variable can start with lowercase letter, and be followed by digits and certain punctuation characters. Also, some variables that start with punctuation characters are permitted.

Note that some variables are reserved for syntactic sugar and have builtin meaning that cannot be re-defined. Please consult the section on syntactic sugar.

3.3 Abstractions

There are many ways to write abstractions in the λ-calculus. We support lambda, lam, and a capital letter L. We also support the backslash (\), though it requires a backslash within a string, an ampersand (&), and the unicode character λ. Please consult the following examples:

> (test <expr> (string->list "\\ x x"))
((expression: (lambda x (var x))) (tokens left: ()))
> (test <expr> (string->list "\\ x y z.x"))
((expression: (lambda x (lambda y (lambda z (var x)))))
  (tokens left: ()))
> (test <expr> (string->list "\\ x.\\y . x"))
((expression: (lambda x (lambda y (var x))))
  (tokens left: ()))
> (test <expr> (string->list "λ x y z:x"))
((expression: (lambda x (lambda y (lambda z (var x)))))
  (tokens left: ()))
> (test <expr> (string->list "λ x y z.x"))
((expression: (lambda x (lambda y (lambda z (var x)))))
  (tokens left: ()))
> (test <expr> (string->list "Lx y z.x"))
((expression: (lambda x (lambda y (lambda z (var x)))))
  (tokens left: ()))
> (test <expr> (string->list "&x y z.x"))
((expression: (lambda x (lambda y (lambda z (var x)))))
  (tokens left: ()))
> (test <expr> (string->list "lambda x y z.x"))
((expression: (lambda x (lambda y (lambda z (var x)))))
  (tokens left: ()))
> (test <expr> (string->list "lam x y z.x"))
((expression: (lambda x (lambda y (lambda z (var x)))))
  (tokens left: ()))
> (test <expr> (string->list "lambdax y z.x"))
((expression: (lambda x (lambda y (lambda z (var x)))))
  (tokens left: ()))
> (test <expr> (string->list "lamx y z.x"))
((expression: (lambda x (lambda y (lambda z (var x)))))
  (tokens left: ()))

As you can see in the examples, we tried to make spaces optional where possible. The dot is interchangeable with a colon.

3.4 Parenthesized expressions & application

Parentheses are used both for grouping and for delimiting applications. All applications associate to the left:

> (test <expr> (string->list "(((((x)))))"))
((expression: (var x)) (tokens left: ()))
> (test <expr> (string->list "(a b c d e)"))
     (applic (applic (applic (var a) (var b)) (var c)) (var d))
     (var e)))
  (tokens left: ()))
> (test <expr> (string->list "(a b c (d e))"))
     (applic (applic (var a) (var b)) (var c))
     (applic (var d) (var e))))
  (tokens left: ()))

3.5 Syntactic sugar

Texts on the λ-calculus employ all sorts of abbreviations and meta-notation. This parser recognizes three:

  1. Convenient nation for Church numerals
  2. Ordered n-tuples
  3. Projections

Church numerals & projections can be thought of as referring to special builtin names that have pre-defined values and which cannot be re-defined. Ordered n-tuples are a form of syntactic sugar.

3.5.1 Church numerals

Church numerals can be written as

  • natural numbers (following Church's Calculi of Lambda Conversion),
  • c3, church10, etc
  • c_5, church_14, etc
  • c-1, church-18, etc

They are expanded into the corresponding λ-expressions. Note that \begin{eqnarray*} c_1 & = & \lambda ~=_{\eta}~ \lambda s.s \end{eqnarray*} and accordingly, our parser generates (lambda s (var s)).

Natural numbers can either be zero, or else they must start with a non-zero digit.

3.5.2 Ordered n-tuples

The parser supports the standard syntax for ordered n-tuples, either using square brackets or using angle-brackets:

> (test <expr> (string->list "<a, b, c>"))
   (lambda g0
       (applic (applic (var g0) (var a)) (var b))
       (var c))))
  (tokens left: ()))
> (test <expr> (string->list "[a, b, c]"))
   (lambda g1
       (applic (applic (var g1) (var a)) (var b))
       (var c))))
  (tokens left: ()))
> (test <expr> (string->list "〈a, b, c〉"))
   (lambda g2
       (applic (applic (var g2) (var a)) (var b))
       (var c))))
  (tokens left: ()))

The standard encoding for the ordered n-tuples requires a fresh variable, which we create using the gensym procedure. It is possible to encode an n-tuple without a fresh variable, as an application of the n-tuple-maker to the n λ-expressions, but while this definition is β-equivalent to the standard definition, it is significantly more cumbersome. If the use of gensym must be avoided, then you can simply re-define the procedure ^n-tuple to the more elaborate definition described above.

Note that we support real Unicode angle brackets in addition to the greater than and less than characters that are used to simulate real angle brackets.

3.5.3 Projections

Projections take an ordered n-tuple and return the k-th tuple. This parser allows for a variety of notation, in accordance with the literature on the λ-calculus:

> (test <expr> (string->list "p12"))
   (lambda z
     (applic (var z) (lambda x1 (lambda x2 (var x1))))))
  (tokens left: ()))
> (test <expr> (string->list "p_12"))
   (lambda z
     (applic (var z) (lambda x1 (lambda x2 (var x1))))))
  (tokens left: ()))
> (test <expr> (string->list "p_1_2"))
   (lambda z
     (applic (var z) (lambda x1 (lambda x2 (var x1))))))
  (tokens left: ()))
> (test <expr> (string->list "proj_1_2"))
   (lambda z
     (applic (var z) (lambda x1 (lambda x2 (var x1))))))
  (tokens left: ()))
> (test <expr> (string->list "proj1_2"))
   (lambda z
     (applic (var z) (lambda x1 (lambda x2 (var x1))))))
  (tokens left: ()))
> (test <expr> (string->list "π12"))
   (lambda z
     (applic (var z) (lambda x1 (lambda x2 (var x1))))))
  (tokens left: ()))
> (test <expr> (string->list "π1_2"))
   (lambda z
     (applic (var z) (lambda x1 (lambda x2 (var x1))))))
  (tokens left: ()))
> (test <expr> (string->list "π21"))
   (lambda z
     (applic (var z) (lambda x1 (lambda x2 (var x1))))))
  (tokens left: ()))

The last example should make it clear that the order of the n & k is not important, and the parser will interpret this correctly, i.e., take the \(\min(n, k)\) index of the tuple out of an ordered \(\max(n, k)\). Also, a separator is only needed between n & k when they are longer than 1-digit each.


Because I envisioned this parser as having to support real work, I tried to invest in flexible syntax for comments.

3.6.1 Line comments

A line comment start with ; or # and goes on until and end-of-line is encountered. These comments are in the spirit of LISP, Scheme, and various shell & scripting languages of the Unix variety.

3.6.2 λ-expression comments

Because λ-expressions can span several lines, and because comments are often used to disable a definition, or a part of a definition, rather than to document the code, the parser supports comments at the level of entire λ-expressions. In the spirit of S-expr-comments in Scheme, comments starting with #; disable entire λ-expressions.

3.6.3 XML comments

To comment-out large areas of code, we support the XML tags <comment> ... </comment>. Anything in between is commented out. Note, however, that XML comments cannot be nested.

3.7 Loading files

The <file> parser supports syntax for specifying the names of files that should be loaded — presumably files with definitions of various terms. The syntax is diverse:

> (test <file> (string->list "
(load \"\")
load []
   ((load "")
     (load "")
     (load "")
     (load "")
     (load "")))
  (tokens left: ()))

The rationale for supporting different ways of specifying a filename is that it is inconvenient to use double quotes within strings, because they must be preceeded by a backslash character. So we support the Scheme/LISP/C conventions for strings, as well as the Pascal convention (of using single quotes), as well as a bracket version.

4 Quo vadis?

Date: 2013-06-06T22:53+0300

Author: Mayer Goldberg

Validate XHTML 1.0