xml-pull: pull-style parsing for large xml files Quick Example ------------- > (require (planet "xml-pull.ss" ("dyoo" "xml-pull.plt" 1 0))) > (define a-taffy (start-xml-pull (open-input-string #< Sue Rhee Dan Garcia Mike Clancy EOF ))) We can consume this XML structure by morsels: > (pull-morsel a-taffy) #3(struct:start-element test-xml ()) > (pull-morsel a-taffy) #3(struct:characters "\n" "") > (pull-morsel a-taffy) #3(struct:start-element person ()) At this point, we are at the start-element of a person. When we see a start-element that is interesting to us, we can _pull-sexp_ the rest of that element as a normalized SXML fragment: > (pull-sexp a-taffy) (person (@) (name (@) "Sue Rhee")) What's nice about this is that we only consume as much of the XML from our input-stream as we need, and moreover, memory usage is bounded to the amount of memory needed to represent the fragment. Structures ---------- There are two structures in this module: taffy and morsel. * taffy A _taffy_ is a core structure that maintains the state of the XML parse. Conceptually, a taffy is an iterator of morsels and SXML fragments. * morsel A _morsel_ is one of the following: * (make-start-element name attributes) where name is a symbol and attributes is a (listof (list symbol string)) * (make-end-element n a) where name is a symbol and attributes is a (listof (list symbol string)) * (make-characters s1 s2) where s1 and s2 are strings * (make-exhausted) Most of these are self-explanatory. We produce an _exhausted_ structure when there are no more elements in the xml to parse. The expected predicates and selectors are also available: > taffy? : any -> boolean > morsel? : any -> boolean > start-element? : any -> boolean > end-element? : any -> boolean > characters? : any -> boolean > exhausted? : any -> boolean > start-element-name : start-element -> symbol > start-element-attributes : start-element -> (listof (list symbol string)) > end-element-name: end-element -> symbol > end-element-attributes end-element -> (listof (list symbol string)) Functions --------- > start-xml-pull: input-port -> taffy Given an input-port, starts the XML parse and returns a taffy. > pull-morsel: taffy -> morsel Takes a taffy and rips off a morsel. > pull-sexp: taffy -> (union sexp exhausted) Assuming that the very last morsel that is pulled off is a start-element, pulls enough morsels to reproduce that element. If the last morsel is not a start-event, raises an error. > pull-sexps/g: taffy symbol -> (generatorof sexp) The result is a _generator_ whose elements are s-expressions those names match the given input symbol. See http://planet.plt-scheme.org/#generator.plt2.0 for more details. Parameters ---------- > current-namespace-translate: symbol -> symbol If provided, this is used to translate the namespace portion of element names in an XML document. By default, this is bound to the identity function. (This is experimental --- I might remove this in a later release of this software in favor of a simpler substitution map similar to what ssax:xml->sxml takes in.) More extenstive example ----------------------- Here is code that takes a large XML document --- the collection of common ontology terms used in bioinformatics --- and prints out the first hundred terms: (module test-xml-pull-2 mzscheme (require (lib "url.ss" "net") (lib "inflate.ss") (lib "pretty.ss") (planet "xml-pull.ss" ("dyoo" "xml-pull.plt" 1 0)) (planet "generator.ss" ("dyoo" "generator.plt" 2 0))) ;; wrap-gunzip: input-port -> input-port ;; Wraps an uncompressor around the input stream. (define (wrap-gunzip original-ip) (define-values (ip op) (make-pipe 32768)) (thread (lambda () (gunzip-through-ports original-ip op))) ip) (define my-url (string->url "http://archive.godatabase.org/latest-termdb/go_daily-termdb.rdf-xml.gz")) (define my-input-port (wrap-gunzip (get-pure-port my-url))) (define my-taffy (start-xml-pull my-input-port)) (define generated-terms (pull-sexps/g my-taffy 'http://www.geneontology.org/dtds/go.dtd#:term)) ;; pretty-print the first 100 terms in the Gene Ontology (let loop ([i 0]) (when (< i 100) (pretty-print (generator-next generated-terms)) (loop (add1 i))))) Thanks ------ Thanks to the PLT folks for writing tools that are very enjoyable to play with. Special thanks to the bioinformaticians at TAIR (http://arabidopsis.org) who taught me to appreciate very large XML datasets. References ---------- SSAX (http://ssax.sourceforge.net/) SXML (http://okmij.org/ftp/Scheme/SXML.html) About Pulldom and Minidom (http://www.prescod.net/python/pulldom.html)