1 Informal quickstart
2 Introduction
2.1 Installation
2.2 Example: a small DSL for ASCII diagrams
2.3 Syntax and semantics
2.4 Parsing the concrete syntax
2.5 From parsing to interpretation
2.6 From interpretation to compilation
3 The language
3.1 Syntax and terminology
3.2 Syntax errors
3.3 Semantics
4 Support API
exn: fail: parsing
5 Caveats and things to do
6 Miscellaneous and thanks

ragg: a Racket AST Generator Generator

Danny Yoo <dyoo@hashcollision.org>

1 Informal quickstart

Salutations! Let’s consider the following scenario: say that we’re given the following string:

"(radiant (humble))"

(... and pretend that we don’t already know about the built-in read function.)

How do we go about turning this kind of string into a structured value? That is, how would we parse it?

We need to first consider the shape of the things we’d like to parse. The string above looks like a deeply nested list of words. How might we describe this formally? A convenient notation to describe the shape of these things is Backus-Naur Form (BNF). So let’s try to notate the structure of nested word lists in BNF.

nested-word-list: WORD

                | LEFT-PAREN nested-word-list* RIGHT-PAREN

What we intend by this notation is this: nested-word-list is either an atomic WORD, or a parenthesized list of any number of nested-word-lists. We use the character * to represent zero or more repetitions of the previous thing, and we treat the uppercased LEFT-PAREN, RIGHT-PAREN, and WORD as placeholders for atomic tokens.

See Installation for instructions on installing ragg.

Here are a few examples of tokens:
> (require ragg/support)
> (token 'LEFT-PAREN)

(token-struct 'LEFT-PAREN #f #f #f #f #f #f)

> (token 'WORD "crunchy" #:span 7)

(token-struct 'WORD "crunchy" #f #f #f 7 #f)

> (token 'RIGHT-PAREN)

(token-struct 'RIGHT-PAREN #f #f #f #f #f #f)

Have we made progress? At this point, we only have a BNF description in hand, but we’re still missing a parser, something to take that description and use it to make structures out of a sequence of tokens.

It’s clear that we don’t yet have a program because there’s no #lang line. We should add one. Put #lang ragg at the top of the BNF description, and save it as a file called "nested-word-list.rkt".


#lang ragg

nested-word-list: WORD

                | LEFT-PAREN nested-word-list* RIGHT-PAREN

Now it is a proper program. But what does it do?

> (require "nested-word-list.rkt")
> parse


It gives us a parse function. Let’s investigate what parse does for us. What happens if we pass it a sequence of tokens?

> (define a-parsed-value
    (parse (list (token 'LEFT-PAREN "(")
                 (token 'WORD "some")
                 (token 'LEFT-PAREN "[")
                 (token 'WORD "pig")
                 (token 'RIGHT-PAREN "]")
                 (token 'RIGHT-PAREN ")"))))
> a-parsed-value

#<syntax (nested-word-list "(" (nested...>

Wait... that looks suspiciously like a syntax object!
> (syntax->datum a-parsed-value)



  (nested-word-list "some")

  (nested-word-list "[" (nested-word-list "pig") "]")


That’s (some [pig]), essentially.

What happens if we pass it a more substantial source of tokens?
; tokenize: string -> (sequenceof token-struct?)
; Generate tokens from a string:
> (define (tokenize s)
    (for/list ([str (regexp-match* #px"\\(|\\)|\\w+" s)])
      (match str
         (token 'LEFT-PAREN str)]
         (token 'RIGHT-PAREN str)]
         (token 'WORD str)])))
; For example:
> (define token-source (tokenize "(welcome (to (((ragg)) ())))"))
> (define v (parse token-source))
> (syntax->datum v)



  (nested-word-list "welcome")



   (nested-word-list "to")





     (nested-word-list "(" (nested-word-list "ragg") ")")


    (nested-word-list "(" ")")




Welcome to ragg.

2 Introduction

ragg is a parsing framework for Racket with the design goal to be easy to use. It includes the following features:
  • It provides a #lang for writing extended BNF grammars. A module written in #lang ragg automatically generates a parser. The output of this parser tries to follow HTDP doctrine; the structure of the grammar informs the structure of the Racket syntax objects it generates.

  • The language uses a few conventions to simplify the expression of grammars. The first rule in the grammar is automatically assumed to be the starting production. Identifiers in uppercase are assumed to represent terminal tokens, and are otherwise the names of nonterminals.

  • Tokenizers can be developed completely independently of parsers. ragg takes a liberal view on tokens: they can be strings, symbols, or instances constructed with token. Furthermore, tokens can optionally provide location: if tokens provide location, the generated syntax objects will as well.

  • The underlying parser should be able to handle ambiguous grammars.

  • It should integrate with the rest of the Racket language toolchain.

2.1 Installation

2.2 Example: a small DSL for ASCII diagrams

To motivate ragg’s design, let’s look at the following toy problem: we’d like to define a language for drawing simple ASCII diagrams. We’d like to be able write something like this:

3 9 X;

6 3 b 3 X 3 b;

3 9 X;

whose interpretation should generate the following picture:













2.3 Syntax and semantics

We’re being very fast-and-loose with what we mean by the program above, so let’s try to nail down some meanings. Each line of the program has a semicolon at the end, and describes the output of several rows of the line drawing. Let’s look at two of the lines in the example:

Then each line consists of a repeat number, followed by pairs of (number, character) chunks. We will assume here that the intent of the lowercased character b is to represent the printing of a 1-character whitespace " ", and for other uppercase letters to represent the printing of themselves.

Once we have a better idea of the pieces of each line, we have a better chance to capture that meaning in a formal notation. Once we have each instruction in a structured format, we should be able to interpret it with a straighforward case analysis.

Here is a first pass at expressing the structure of these line-drawing programs.

2.4 Parsing the concrete syntax


#lang ragg

drawing: rows*

rows: repeat chunk+ ";"

repeat: INTEGER


Syntax and terminology describes ragg’s syntax in more detail.

We write a ragg program as an extended BNF grammar, where patterns can be:
  • the names of other rules (e.g. chunk)

  • literal and symbolic token names (e.g. ";", INTEGER)

  • quantified patterns (e.g. + to represent one-or-more repetitions)

The result of a ragg program is a module with a parse function that can parse tokens and produce a syntax object as a result.

Let’s exercise this function:
> (require ragg/support)
> (require "simple-line-drawing.rkt")
> (define stx
    (parse (list (token 'INTEGER 6)
                 (token 'INTEGER 2)
                 (token 'STRING " ")
                 (token 'INTEGER 3)
                 (token 'STRING "X")
> (syntax->datum stx)

'(drawing (rows (repeat 6) (chunk 2 " ") (chunk 3 "X") ";"))

Tokens can either be: plain strings, symbols, or instances produced by the token function. (Plus a few more special cases, one in which we’ll describe in a moment.)

Preferably, we want to attach each token with auxiliary source location information. The more source location we can provide, the better, as the syntax objects produced by parse will incorporate them.

Let’s write a helper function, a lexer, to help us construct tokens more easily. The Racket standard library comes with a module called parser-tools/lex which can help us write a position-sensitive tokenizer:

> (require parser-tools/lex)
> (define (tokenize ip)
    (port-count-lines! ip)
    (define my-lexer
        [(repetition 1 +inf.0 numeric)
         (token 'INTEGER (string->number lexeme))]
         (token 'STRING lexeme)]
         (token 'STRING " ")]
         (token ";" lexeme)]
         (token 'WHITESPACE lexeme #:skip? #t)]
    (define (next-token) (my-lexer ip))
> (define a-sample-input-port (open-input-string "6 2 b 3 X;"))
> (define token-thunk (tokenize a-sample-input-port))
; Now we can pass token-thunk to the parser:
> (define another-stx (parse token-thunk))
> (syntax->datum another-stx)

'(drawing (rows (repeat 6) (chunk 2 " ") (chunk 3 "X") ";"))

; The syntax object has location information:
> (syntax-line another-stx)


> (syntax-column another-stx)


> (syntax-span another-stx)


There are a few things to note from this lexer example:
  • The parse function can consume either sequences of tokens, or a function that produces tokens. Both of these are considered sources of tokens.

  • As a special case for acceptable tokens, a token can also be an instance of the position-token structure of parser-tools/lex, in which case the token will try to derive its position from that of the position-token.

  • The parse function will stop reading from a token source if any token is void.

  • The parse function will skip over any token with the #:skip? attribute. Elements such as whitespace and comments will often have #:skip? set to #t.

2.5 From parsing to interpretation

We now have a parser for programs written in this simple-line-drawing language. Our parser will give us back syntax objects:
> (define parsed-program
    (parse (tokenize (open-input-string "3 9 X; 6 3 b 3 X 3 b; 3 9 X;"))))
> (syntax->datum parsed-program)


  (rows (repeat 3) (chunk 9 "X") ";")

  (rows (repeat 6) (chunk 3 " ") (chunk 3 "X") (chunk 3 " ") ";")

  (rows (repeat 3) (chunk 9 "X") ";"))

Moreover, we know that these syntax objects have a regular, predictable structure. Their structure follows the grammar, so we know we’ll be looking at values of the form:

(drawing (rows (repeat <number>)
               (chunk <number> <string>) ... ";")

where drawing, rows, repeat, and chunk should be treated literally, and everything else will be numbers or strings.

Still, these syntax object values are just inert structures. How do we interpret them, and make them print? We did claim at the beginning of this section that these syntax objects should be fairly easy to case-analyze and interpret, so let’s do it.

This is a very quick-and-dirty treatment of syntax-parse. See the syntax/parse documentation for a gentler guide to its features.

Racket provides a special form called syntax-parse in the syntax/parse library. syntax-parse lets us do a structural case-analysis on syntax objects: we provide it a set of patterns to parse and actions to perform when those patterns match.

As a simple example, we can write a function that looks at a syntax object and says #t if it’s the literal yes, and #f otherwise:

> (require syntax/parse)
; yes-syntax-object?: syntax-object -> boolean
; Returns true if the syntax-object is yes.
> (define (yes-syntax-object? stx)
    (syntax-parse stx
      [(~literal yes)
> (yes-syntax-object? #'yes)


> (yes-syntax-object? #'nooooooooooo)


Here, we use ~literal to let syntax-parse know that yes should show up literally in the syntax object. The patterns can also have some structure to them, such as:

({~literal drawing} rows-stxs ...)

which matches on syntax objects that begin, literally, with drawing, followed by any number of rows (which are syntax objects too).

Now that we know a little bit more about syntax-parse, we can use it to do a case analysis on the syntax objects that our parse function gives us. We start by defining a function on syntax objects of the form (drawing rows-stx ...).
> (define (interpret-drawing drawing-stx)
    (syntax-parse drawing-stx
      [({~literal drawing} rows-stxs ...)
       (for ([rows-stx (syntax->list #'(rows-stxs ...))])
         (interpret-rows rows-stx))]))

When we encounter a syntax object with (drawing rows-stx ...), then interpret-rows each rows-stx.

Let’s define interpret-rows now:
> (define (interpret-rows rows-stx)
    (syntax-parse rows-stx
      [({~literal rows}
        ({~literal repeat} repeat-number)
        chunks ... ";")
       (for ([i (syntax-e #'repeat-number)])
         (for ([chunk-stx (syntax->list #'(chunks ...))])
           (interpret-chunk chunk-stx))

For a rows, we extract out the repeat-number out of the syntax object and use it as the range of the for loop. The inner loop walks across each chunk-stx and calls interpret-chunk on it.

Finally, we need to write a definition for interpret-chunk. We want it to extract out the chunk-size and chunk-string portions, and print to standard output:

> (define (interpret-chunk chunk-stx)
    (syntax-parse chunk-stx
      [({~literal chunk} chunk-size chunk-string)
       (for ([k (syntax-e #'chunk-size)])
         (display (syntax-e #'chunk-string)))]))

Here are the definitions in a single file: interpret.rkt.

With these definitions in hand, now we can pass it syntax objects that we construct directly by hand:

> (interpret-chunk #'(chunk 3 "X"))


> (interpret-drawing #'(drawing (rows (repeat 5) (chunk 3 "X") ";")))






or we can pass it the result generated by our parser:
> (define parsed-program
    (parse (tokenize (open-input-string "3 9 X; 6 3 b 3 X 3 b; 3 9 X;"))))
> (interpret-drawing parsed-program)













And now we’ve got an interpreter!

2.6 From interpretation to compilation

For a gentler tutorial on writing #lang extensions, see: F*dging up a Racket.

(Just as a warning: the following material is slightly more advanced, but shows how writing a compiler for the line-drawing language reuses the ideas for the interpreter.)

Wouldn’t it be nice to be able to write something like:

3 9 X;

6 3 b 3 X 3 b;

3 9 X;

and have Racket automatically compile this down to something like this?
(for ([i 3])
  (for ([k 9]) (displayln "X"))
(for ([i 6])
  (for ([k 3]) (displayln " "))
  (for ([k 3]) (displayln "X"))
  (for ([k 3]) (displayln " "))
(for ([i 3])
  (for ([k 9]) (displayln "X"))

Well, of course it won’t work: we don’t have a #lang line.

Let’s add one.


#lang ragg/examples/simple-line-drawing

3 9 X;

6 3 b 3 X 3 b;

3 9 X;

Now "letter-i.rkt" is a program.

How does this work? From the previous sections, we’ve seen how to take the contents of a file and interpret it. What we want to do now is teach Racket how to compile programs labeled with this #lang line. We’ll do two things:

The second part, the writing of the transformation rules, will look very similar to the definitions we wrote for the interpreter, but the transformation will happen at compile-time. (We could just resort to simply calling into the interpreter we just wrote up, but this section is meant to show that compilation is also viable.)

We do the first part by defining a module reader: a module reader tells Racket how to parse and compile a file. Whenever Racket sees a #lang <name>, it looks for a corresponding module reader in "<name>/lang/reader".

Here’s the definition for "ragg/examples/simple-line-drawing/lang/reader.rkt":


#lang s-exp syntax/module-reader
#:read my-read
#:read-syntax my-read-syntax
#:whole-body-readers? #t
(require ragg/examples/simple-line-drawing/lexer
(define (my-read in)
  (syntax->datum (my-read-syntax #f in)))
(define (my-read-syntax src ip)
  (list (parse src (tokenize ip))))

We use a helper module syntax/module-reader, which provides utilities for creating a module reader. It uses the lexer and ragg-generated parser we defined earlier (saved into lexer.rkt and grammar.rkt modules), and also tells Racket that it should compile the forms in the syntax object using a module called "semantics.rkt".

For a systematic treatment on capturing the semantics of a language, see Programming Languages: Application and Interpretation.

Let’s look into "semantics.rkt" and see what’s involved in compilation:


#lang racket/base
(require (for-syntax racket/base syntax/parse))
(provide #%module-begin
         ;; We reuse Racket's treatment of raw datums, specifically
         ;; for strings and numbers:
         ;; And otherwise, we provide definitions of these three forms.
         ;; During compiliation, Racket uses these definitions to
         ;; rewrite into for loops, displays, and newlines.
         drawing rows chunk)
;; Define a few compile-time functions to do the syntax rewriting:
  (define (compile-drawing drawing-stx)
    (syntax-parse drawing-stx
      [({~literal drawing} rows-stxs ...)
     (syntax/loc drawing-stx
       (begin rows-stxs ...))]))
  (define (compile-rows rows-stx)
    (syntax-parse rows-stx
      [({~literal rows}
        ({~literal repeat} repeat-number)
        chunks ...
       (syntax/loc rows-stx
         (for ([i repeat-number])
           chunks ...
  (define (compile-chunk chunk-stx)
    (syntax-parse chunk-stx
      [({~literal chunk} chunk-size chunk-string)
       (syntax/loc chunk-stx
         (for ([k chunk-size])
           (display chunk-string)))])))
;; Wire up the use of "drawing", "rows", and "chunk" to these
;; transformers:
(define-syntax drawing compile-drawing)
(define-syntax rows compile-rows)
(define-syntax chunk compile-chunk)

The semantics hold definitions for compile-drawing, compile-rows, and compile-chunk, similar to what we had for interpretation with interpret-drawing, interpret-rows, and interpret-chunk. However, compilation is not the same as interpretation: each definition does not immediately execute the act of drawing, but rather returns a syntax object whose evaluation will do the actual work.

There are a few things to note:

Altogether, ragg’s intent is to be a parser generator generator for Racket that’s easy and fun to use. It’s meant to fit naturally with the other tools in the Racket language toolchain. Hopefully, it will reduce the friction in making new languages with alternative concrete syntaxes.

The rest of this document describes the ragg language and the parsers it generates.

3 The language

3.1 Syntax and terminology

A program in the ragg language consists of the language line #lang ragg, followed by a collection of rules and line comments.

A rule is a sequence consisting of: a rule identifier, a colon ":", and a pattern.

A rule identifier is an identifier that is not in upper case.

A token identifier is an identifier that is in upper case.

An identifier is a character sequence of letters, numbers, and characters in "-.!$%&/<=>?^_~@". It must not contain * or +, as those characters are used to denote quantification.

A pattern is one of the following:
  • an implicit sequence of patterns separated by whitespace

  • a terminal: either a literal string or a token identifier

  • a rule identifier

  • a choice pattern: a sequence of patterns delimited with | characters.

  • a quantifed pattern: a pattern followed by either * (“zero or more”) or + (“one or more”)

  • an optional pattern: a pattern surrounded by [ and ]

  • an explicit sequence: a pattern surrounded by ( and )

A line comment begins with either # or ; and continues till the end of the line.

For example, in the following program:

#lang ragg

;; A parser for a silly language

sentence: verb optional-adjective object

verb: greeting

optional-adjective: ["happy" | "frumpy"]

greeting: "hello" | "hola" | "aloha"

object: "world" | WORLD

the elements sentence, verb, greeting, and object are rule identifiers. The first rule, sentence: verb optional-adjective object, is a rule whose right side is an implicit pattern sequence of three sub-patterns. The uppercased WORLD is a token identifier. The fourth rule in the program associates greeting with a choice pattern.

More examples:
  • A BNF for binary strings that contain an equal number of zeros and ones.

    #lang ragg

    equal: [zero one | one zero]   ;; equal number of "0"s and "1"s.

    zero: "0" equal | equal "0"    ;; has an extra "0" in it.

    one: "1" equal | equal "1"     ;; has an extra "1" in it.

  • A BNF for JSON-like structures.

    #lang ragg

    json: number | string

        | array  | object

    number: NUMBER

    string: STRING

    array: "[" [json ("," json)*] "]"

    object: "{" [kvpair ("," kvpair)*] "}"

    kvpair: ID ":" json

The ragg github source repository includes several more examples.

3.2 Syntax errors

Besides the basic syntax errors that can occur with a malformed grammar, there are a few other classes of situations that #lang ragg will consider as syntax errors.

ragg will raise a syntax error if the grammar:
  • doesn’t have any rules.

  • has a rule with the same left hand side as any other rule.

  • refers to rules that have not been defined. e.g. the following program:

    #lang ragg

    foo: [bar]

    should raise an error because bar has not been defined, even though foo refers to it in an optional pattern.

  • uses the token name EOF; the end-of-file token type is reserved for internal use by ragg.

  • contains a rule that has no finite derivation. e.g. the following program:

    #lang ragg

    infinite-a: "a" infinite-a

    should raise an error because no finite sequence of tokens will satisfy infinite-a.

Otherwise, ragg should be fairly tolerant and permit even ambiguous grammars.

3.3 Semantics

A program written in #lang ragg produces a module that provides a few bindings. The most important of these is parse:


(parse [source] token-source)  syntax?

  source : any/c = #f
  token-source : 
(or/c (sequenceof token)
      (-> token))
Parses the sequence of tokens according to the rules in the grammar, using the first rule as the start production. The parse must completely consume token-source.

The token source can either be a sequence, or a 0-arity function that produces tokens.

A token in ragg can be any of the following values:

A token whose type is either void or 'EOF terminates the source.

If parse succeeds, it will return a structured syntax object. The structure of the syntax object follows the overall structure of the rules in the BNF. For each rule r and its associated pattern p, parse generates a syntax object #'(r p-value) where p-value’s structure follows a case analysis on p:

Consequently, it’s only the presence of rule identifiers in a rule’s pattern that informs the parser to introduces nested structure into the syntax object.

If the grammar has ambiguity, ragg will choose and return a parse, though it does not guarantee which one it chooses.

If the parse cannot be performed successfully, or if a token in the token-source uses a type that isn’t mentioned in the grammar, then parse raises an instance of exn:fail:parsing.

It’s often convenient to extract a parser for other non-terminal rules in the grammar, and not just for the first rule. A ragg-generated module also provides a form called make-rule-parser to extract a parser for the other non-terminals:


(make-rule-parser name)

Constructs a parser for the name of one of the non-terminals in the grammar.

For example, given the ragg program "simple-arithmetic-grammar.rkt":


#lang ragg

expr : term ('+' term)*

term : factor ('*' factor)*

factor : INT

the following interaction shows how to extract a parser for terms.
> (require "simple-arithmetic-grammar.rkt")
> (define term-parse (make-rule-parser term))
> (define tokens (list (token 'INT 3)
                       (token 'INT 4)))
> (syntax->datum (parse tokens))

'(expr (term (factor 3) "*" (factor 4)))

> (syntax->datum (term-parse tokens))

'(term (factor 3) "*" (factor 4))

> (define another-token-sequence
    (list (token 'INT 1) "+" (token 'INT 2)
          "*" (token 'INT 3)))
> (syntax->datum (parse another-token-sequence))

'(expr (term (factor 1)) "+" (term (factor 2) "*" (factor 3)))

; Note that term-parse will break on another-token-sequence
; as it does not know what to do with the "+"
> (term-parse another-token-sequence)

Encountered parsing error near token '+ ("+") while parsing

#f [line=#f, column=#f, offset=#f]

Finally, the module provides a set of all the used token types in the grammar in all-token-types:


all-token-types : (setof symbol?)

A set of all the token types used in a grammar.

For example:
> (require "simple-arithmetic-grammar.rkt")
> all-token-types

(set '+ '* 'INT)

4 Support API

 (require ragg/support)

The ragg/support module provides functions to interact with ragg programs. The most useful is the token function, which produces tokens to be parsed.


(token type    
  #:line line    
  #:column column    
  #:offset offset    
  #:span span    
  #:skip? skip?])  token-struct?
  type : (or/c string? symbol?)
  val : any/c = #f
  line : (or/c positive-integer? #f) = #f
  column : (or/c natural-number? #f) = #f
  offset : (or/c positive-integer? #f) = #f
  span : (or/c natural-number? #f) = #f
  skip? : boolean? = #f
Creates instances of token-structs.

The syntax objects produced by a parse will inject the value val in place of the token name in the grammar.

If #:skip? is true, then the parser will skip over it during a parse.


(struct token-struct (type val offset line column span skip?)
  #:extra-constructor-name make-token-struct
  type : symbol?
  val : any/c
  offset : (or/c positive-integer? #f)
  line : (or/c natural-number? #f)
  column : (or/c positive-integer? #f)
  span : (or/c natural-number? #f)
  skip? : boolean?
The token structure type.

Rather than directly using the token-struct constructor, please use the helper function token to construct instances.


(struct exn:fail:parsing exn:fail (message
  #:extra-constructor-name make-exn:fail:parsing)
  message : string?
  continuation-marks : continuation-mark-set?
  srclocs : (listof srcloc?)
The exception raised when parsing fails.

exn:fail:parsing implements Racket’s prop:exn:srcloc property, so if this exception reaches DrRacket’s default error handler, DrRacket should highlight the offending locations in the source.

5 Caveats and things to do

Here are a few caveats and future aims for ragg.

6 Miscellaneous and thanks

Thanks to Matthew Flatt for pointing me to cfg-parser from the cfg-parser library. Joe Politz gave me good advice and feedback. Also, he suggested the name “ragg”. Other alternatives I’d been considering were “autogrammar” or “chompy”. Thankfully, he is a better Namer than me. Daniel Patterson provided feedback that led to make-rule-parser. Robby Findler and Guillaume Marceau provided steadfast suggestions to look into other parsing frameworks like SDF and SableCC. Special thanks to Shriram Krishnamurthi, who convinced me that other people might find this package useful.