Introduction ============ One project that I thought might be interesting is to write a Python interpreter in PLT Scheme, using the library tools that come with PLT Scheme alone. There are a few things I want to get out of this: * A stronger knowledge of Python. If I try implementing Python, I won't be able to help but understand it better. At least, that's my hope. * A working knowledge of PLT Scheme. I have not worked with PLT Scheme in any serious project before. * An applied knowledge of language theory. Hey, if I'm going to be a graduate student in programming language theory, I'd better have a project that I'm proud of. (Or, at least, something that I can point at and say: well, at least I'm trying something.) The notes here are my own personal documentation on what I'm learning as I'm going along. This is a living document that will be revised as I learn more about this. It might be interesting to polish up these notes at the end, and see if I get anything coherant, but don't expect too much. *grin* I'm also using this to learn more about [Markdown](http://daringfireball.net/projects/markdown/), which appears to be a great way to write these informal notes for myself. Previous work in this area ========================== What have people done in this area before? The project, [From Python to PLT Scheme](http://www.python.org/pycon/dc2004/papers/25/), documents the addition of Python as an experimental language. They appear to have taken the approach of directly talking to the CPython runtime. This isn't what I'm planning to do: I'm instead thinking of direct interpretation from Scheme itself. The [PyPy](http://codespeak.net/pypy/index.cgi?news) project is a metacircular interpreter of Python in Python. Basically, I'm doing this as a learning experience, so I don't mind if I'm reinventing the wheel a bit. *grin* But I should look at other projects from time to time, to better understand the mistakes that I'm repeating. PLT Scheme comes with their own Algol 60 system, so I should definitely read that sometime before I really get into this project. Plan of Attack ============== I'll start by writing a lexer, the component that takes in an input port and produces tokens. I'll then go at the parser. After that, a quick-and-dirty interpreter. Maybe after that, I'll finish EOPL. Lexing ------ According to [Grammar/Grammar](http://cvs.sourceforge.net/viewcvs.py/python/python/dist/src/Grammar/Grammar?view=markup) in the Python source distribution, there are only a few simple token types. #diagram:token NAME #diagram:token NUMBER #diagram:token STRING #diagram:token NEWLINE #diagram:token ENDMARKER #diagram:token INDENT #diagram:token DEDENT plus a bunch of keywords and operators, but that shouldn't be too bad. PLT Scheme comes with a nice package in its parser-tools package, but the documentation on it seems a litte scant; I probably don't know how to find it yet. I wonder, though, why the documentation for many of the collections don't show up in the [PLT Documentation](http://download.plt-scheme.org/doc/). I guess that a lot of the documentation will simply have to be searched through their respective doc.txt files. I'll start by handling simple integer numbers, and will work from there. The [tokenize](http://www.python.org/doc/lib/module-tokenize.html) module in the Python Standard Library seems really useful as a guide, although I can't always trust it: it appears that Python's tokenize itself has a bug ([SF#124621](https://sourceforge.net/tracker/index.php?func=detail&aid=1224621&group_id=5470&atid=105470)) that I'll have to fix. Another great tool is PLT Scheme itself: many of the collections themselves use the parser-tools, and they're really useful as examples. I've been reusing lots of sections of the scheme-lexer.ss file in the syntax-color collection with good success. As a surprise to myself: it looks like PLT Scheme comes with a really powerful regular expression package that allows for arbitrary boolean operations on regexes, something that I've always wanted! This is described in Janusz A. Brzozowski's paper [Derivatives of Regular Expressions](http://portal.acm.org/citation.cfm?id=321249) from the 1960's. But I wonder why this is not being used as a core library in other standard libraries! PLT Scheme's lexer package has a fairly simple interface: we provide regular expressions and actions to perform when those expressions match, and out pops a lexer function. This lexer function can then take input ports as input and return single tokens. For example, the following is a really silly example of a lexer that breaks a string via whitespace: (module test-lexing mzscheme (require (prefix : (lib "lex-sre.ss" "parser-tools")) (lib "lex.ss" "parser-tools")) (provide (all-defined)) (define my-simple-lexer (lexer [(:+ whitespace) (my-simple-lexer input-port)] [(:+ (:~ whitespace)) lexeme] [(eof) eof])) ;; split a string into a list of elements (define split (lambda (s) (let ((input (open-input-string s))) (let loop ((token (my-simple-lexer input))) (if (not (eof-object? token)) (cons token (loop (my-simple-lexer input))) '()))))) Of course, no sane person would use a lexer in such a simple situation like this. But anyway, there are a set of predefined regular expression "abbreviations" --- *whitespace* being one of them --- and a mechanism for defining more abbreviations. The core of a lexer are the regex/action pairs, so when we hit something like: [(:+ whitespace) (my-simple-lexer input-port)] what we are asking the lexer to do is to just skip one or more *whitespace* by reapplying the lexer on the rest of the given *input-port*. One thing that's probably so obvious that the documentation doesn't mention it is that the regular expression syntax defined in doc.txt can't be used outside a lexer macro. Just something to note. There's a mechanism for defining different token types: the example above just returns strings as tokens, but there's a more general way to define specific token types that will work nicely with the rest of the PLT parser-tools.