Introduction
============

One project that I thought might be interesting is to write a Python
interpreter in PLT Scheme, using the library tools that come with PLT
Scheme alone.  There are a few things I want to get out of this:

* A stronger knowledge of Python.  If I try implementing Python, I
   won't be able to help but understand it better.  At least, that's
   my hope.

* A working knowledge of PLT Scheme.  I have not worked with PLT
  Scheme in any serious project before.

* An applied knowledge of language theory.  Hey, if I'm going to be a
  graduate student in programming language theory, I'd better have a
  project that I'm proud of.  (Or, at least, something that I can
  point at and say: well, at least I'm trying something.)


The notes here are my own personal documentation on what I'm learning
as I'm going along.  This is a living document that will be revised as
I learn more about this.  It might be interesting to polish up these
notes at the end, and see if I get anything coherant, but don't expect
too much.  *grin* I'm also using this to learn more about
[Markdown](http://daringfireball.net/projects/markdown/), which
appears to be a great way to write these informal notes for myself.


Previous work in this area
==========================

What have people done in this area before?

The project, [From Python to PLT
Scheme](http://www.python.org/pycon/dc2004/papers/25/),
documents the addition of Python as an experimental language.  They
appear to have taken the approach of directly talking to the CPython
runtime.  This isn't what I'm planning to do: I'm instead thinking of
direct interpretation from Scheme itself.

The [PyPy](http://codespeak.net/pypy/index.cgi?news) project is a
metacircular interpreter of Python in Python.

Basically, I'm doing this as a learning experience, so I don't mind if
I'm reinventing the wheel a bit.  *grin* But I should look at other
projects from time to time, to better understand the mistakes that I'm
repeating.  PLT Scheme comes with their own Algol 60 system, so I
should definitely read that sometime before I really get into this
project.


Plan of Attack
==============

I'll start by writing a lexer, the component that takes in an input
port and produces tokens.  I'll then go at the parser.  After that, a
quick-and-dirty interpreter.  Maybe after that, I'll finish EOPL.


Lexing
------

According to
[Grammar/Grammar](http://cvs.sourceforge.net/viewcvs.py/python/python/dist/src/Grammar/Grammar?view=markup)
in the Python source distribution, there are only a few simple token
types.

    #diagram:token NAME
    #diagram:token NUMBER
    #diagram:token STRING
    #diagram:token NEWLINE
    #diagram:token ENDMARKER
    #diagram:token INDENT
    #diagram:token DEDENT

plus a bunch of keywords and operators, but that shouldn't be too bad.

PLT Scheme comes with a nice package in its parser-tools package, but
the documentation on it seems a litte scant; I probably don't know how
to find it yet.  I wonder, though, why the documentation for many of
the collections don't show up in the [PLT
Documentation](http://download.plt-scheme.org/doc/).  I guess that a
lot of the documentation will simply have to be searched through their
respective doc.txt files.


I'll start by handling simple integer numbers, and will work from
there.  The
[tokenize](http://www.python.org/doc/lib/module-tokenize.html) module
in the Python Standard Library seems really useful as a guide,
although I can't always trust it: it appears that Python's tokenize
itself has a bug
([SF#124621](https://sourceforge.net/tracker/index.php?func=detail&aid=1224621&group_id=5470&atid=105470))
that I'll have to fix.


Another great tool is PLT Scheme itself: many of the collections
themselves use the parser-tools, and they're really useful as
examples.  I've been reusing lots of sections of the scheme-lexer.ss
file in the syntax-color collection with good success.


As a surprise to myself: it looks like PLT Scheme comes with a really
powerful regular expression package that allows for arbitrary boolean
operations on regexes, something that I've always wanted!  This is
described in Janusz A. Brzozowski's paper [Derivatives of Regular
Expressions](http://portal.acm.org/citation.cfm?id=321249) from the
1960's.  But I wonder why this is not being used as a core library in
other standard libraries!


PLT Scheme's lexer package has a fairly simple interface: we provide
regular expressions and actions to perform when those expressions
match, and out pops a lexer function.  This lexer function can then
take input ports as input and return single tokens.  For example, the
following is a really silly example of a lexer that breaks a string
via whitespace:

   (module test-lexing mzscheme
     (require (prefix : (lib "lex-sre.ss" "parser-tools"))
              (lib "lex.ss" "parser-tools"))
     (provide (all-defined))

     (define my-simple-lexer
        (lexer
          [(:+ whitespace) (my-simple-lexer input-port)]
	  [(:+ (:~ whitespace)) lexeme]
	  [(eof) eof]))

     ;; split a string into a list of elements
     (define split
        (lambda (s)
           (let ((input (open-input-string s)))
              (let loop ((token (my-simple-lexer input)))
                 (if (not (eof-object? token))
                     (cons token (loop (my-simple-lexer input)))
                     '())))))

Of course, no sane person would use a lexer in such a simple situation
like this.


But anyway, there are a set of predefined regular expression
"abbreviations" --- *whitespace* being one of them --- and a mechanism
for defining more abbreviations.  The core of a lexer are the
regex/action pairs, so when we hit something like:

    [(:+ whitespace) (my-simple-lexer input-port)]

what we are asking the lexer to do is to just skip one or more
*whitespace* by reapplying the lexer on the rest of the given
*input-port*.

One thing that's probably so obvious that the documentation doesn't
mention it is that the regular expression syntax defined in doc.txt
can't be used outside a lexer macro.  Just something to note.

There's a mechanism for defining different token types: the example
above just returns strings as tokens, but there's a more general way
to define specific token types that will work nicely with the rest of
the PLT parser-tools.