1 The Problem
2 Exploration
3 Acknowledgements
Version: 5.1.2

A Racket Slice: munging IRC chat logs

Danny Yoo <dyoo@hashcollision.org>

Source code can be found at: https://github.com/dyoo/racket-slices. The latest version of this document lives in http://hashcollision.org/racket-slices/irc-parsing.

1 The Problem

Let’s say that we have some source of text, such as IRC chat logs, and we’d like to extract information from them. To use the precise technical term, we’d like to munge.

2 Exploration

The following is an interactive session between us and the Racket REPL, using the full-fledged racket language.

Can we take a slice at this problem using the Racket language?

Let’s explore this and fire up Racket.
> (require net/url)
We’ll want to use the net/url library, which allows us to suck the content out of a URL. First, let’s open up an input port.
> (define irc-port
    (get-pure-port
     (string->url "http://racket-lang.org/irc-logs/20110802.txt")))

Note that the following: “get-pure-port” in this document is hyperlinked. When you click the link, it shows where the function lives. In this case, get-pure-port lives in net/url, as does string->url.

An input port is a source for stuff. Let’s suck up ten lines of stuff and see what it looks like. We can walk along the lines in our port with a for loop.
> (for ([line (in-lines irc-port)]
        [i (in-range 10)])
    (printf "~s\n" line))

"00:00 (join) neilv"

"00:02 neilv: if i want to install the racket version that is on the path to becoming 5.1.2, do i get it from http://pre.racket-lang.org/installers/ or somewhere else?"

"00:03 neilv: > You are about to download: plt-5.1.2.3-src-unix.tgz (16M)"

"00:03 neilv: seems promising, but I'm just double-checking"

"00:04 offby1: couldn't tell ya"

"00:08 (join) kennyd"

"00:11 (join) asurai"

"00:11 (quit) asurai: Client Quit"

"00:14 neilv: the c part of the build seems a lot faster since the gui ffi stuff, unsurprisingly"

"00:14 (quit) bmp: Ping timeout: 260 seconds"

Ok, good! It looks like we’re getting back strings. When we look at those strings more closely, it seems that have a fairly regular structure. There’s some chunk in front that looks like a timestamp, followed by one of two things:
  • An IRC administrative action, like "00:00 (join) neilv", or

  • A chat message, like "00:04 offby1: couldn't tell ya".

When we have strings with regular structure, we can use regular expressions to search through it. For example, we can try to match a regular expression pattern against a string like this:
> (regexp-match #px"^(\\d\\d):(\\d\\d)"
                "12:42")

'("12:42" "12" "42")

> (regexp-match #px"^(\\d\\d):(\\d\\d)"
                "twelve:forty-two")

#f

You can find more details about regular expressions in the Guide.

The #px"^(\\d\\d):(\\d\\d)" is a Perl-compatible regular expression that captures the pattern: “two grouped digits, followed by a colon, followed by two more grouped digits.” When we match, we get back a list which includes the groups. If we don’t, well, we get back #f, which is fine.

Let’s squirrel away two regular expressions that we’ll use to pattern match those IRC chat lines.
> (define action-regexp
    #px"^(\\d\\d):(\\d\\d) [(](.+)[)] (.+)")
> (define chat-regexp
    #px"^(\\d\\d):(\\d\\d) ([^()]+) (.+)")

Can we use these patterns to match across all of them? Let’s see! Let’s go through a few more lines and see if we can match them.
> (for ([line (in-lines irc-port)]
        [i (in-range 5)])
    (cond
      [(regexp-match action-regexp line)
       (printf "I matched an action.\n")]
      [(regexp-match chat-regexp line)
       (printf "I matched a message\n")]
      [else
       (error 'oops-i-did-it-again)]))

I matched a message

I matched an action.

I matched an action.

I matched an action.

I matched an action.

... uh. Probably not. We’re certainly munging.

If things had broken, we’d have seen an error. We don’t, so obviously things are perfect.

We don’t necessarily want to deal with strings all the time. We can use structures to represent the parsed data we’re getting from this IRC log. Let’s define two of them.
> (struct action (hour minute type msg) #:transparent)
> (struct chat (hour minute who msg) #:transparent)
We want to make the structure transparent by using the #:transparent option to struct. Otherwise, structures act very much like black boxes, and we don’t get to printf them out in a way that makes it easy to see their contents.

Ok, now that we’ve defined our structures, let’s do this. We’ll write a function to take a line and parse it into either an action or a chat.
> (define (parse-irc a-line)
    (define (on-action-line a-match)
      (action (second a-match)
              (third a-match)
              (fourth a-match)
              (fifth a-match)))
  
    (define (on-chat-line a-match)
      (chat (second a-match)
            (third a-match)
            (fourth a-match)
            (fifth a-match)))
  
    (cond
      [(regexp-match action-regexp a-line)
       => on-action-line]
      [(regexp-match chat-regexp a-line)
       => on-chat-line]
      [else
       (error 'oops-i-did-it-again)]))
> (parse-irc (read-line irc-port))

(action "01" "15" "join" "neilv")

> (parse-irc (read-line irc-port))

(action "01" "30" "quit" "dnolen: Quit: dnolen")

Nice! We’re using an advanced feature of cond; the arrow (=>) lets us say that if the left-hand-side evaluates to a true value, then it calls the function, named by the right-hand-side, against that value.

Hmmm. But in retrospect, though, using second, third, etc. is a slightly verbose, error-prone way to destructure the list that we’re getting back from regexp-match. Can we do better?

See the documentation of racket/match for more information on the pattern-matching library.

We can, with the structure-matching library match, which lets us express the code more nicely. Let’s try this again...
> (define (parse-irc a-line)
    (match a-line
      [(regexp action-regexp
               (list _ hour minute type msg))
       (action hour minute type msg)]
      [(regexp chat-regexp
               (list _ hour minute who msg))
       (chat hour minute who msg)]
      [else
       (error 'oops-i-did-it-again)]))
> (parse-irc (read-line irc-port))

(action "01" "34" "quit" "jonrafkind: Ping timeout: 250 seconds")

> (parse-irc (read-line irc-port))

(action "01" "44" "join" "hkBst")

Ok, better. We can probably keep at it to make parse-irc even smaller, but we should probably stop fiddling with it.

Let’s use this function on a few lines.
> (for ([line (in-lines irc-port)]
        [i (in-range 5)])
    (printf "~s\n" (parse-irc line)))

#(struct:action "01" "57" "quit" "dherman: Quit: dherman")

#(struct:action "02" "14" "quit" "hussaibi: Ping timeout: 252 seconds")

#(struct:action "02" "14" "quit" "hussaibi_: Ping timeout: 252 seconds")

#(struct:action "03" "05" "quit" "sheikra: Ping timeout: 255 seconds")

#(struct:action "03" "16" "quit" "neilv: Ping timeout: 255 seconds")

Wow! That’s a lot of quitting. That’s probably a sign that this session should wind down as well. Let’s look through just a few more, just to see a few chats.
> (define chat-lines
    (for/list ([line (in-lines irc-port)]
               #:when (chat? (parse-irc line)))
      (parse-irc line)))
> (for ([a-chat chat-lines]
        [i (in-range 5)])
    (printf "~s\n" a-chat))

#(struct:chat "08" "50" "RacketCommitBot: [racket] plt pushed 1 new commit to master:" "https://github.com/plt/racket/commit/fba1777b8ad6d84200e17c85896f9f6d210b0d1d")

#(struct:chat "08" "50" "RacketCommitBot: [racket/master] fix contract - Matthew" "Flatt")

#(struct:chat "12" "26" "ChibaPet: Eli, are you" "about?")

#(struct:chat "12" "59" "clklein: I was pleasantly surprised to see that `for-template' works in `provide' but now unpleasantly confused about how to use" "it.")

#(struct:chat "13" "01" "drdo: net/imap doesn't support SEARCH or am i missing" "something?")

No “maybe” about it. See rackunit for more details on how to write unit test cases.

Ooops! It looks like our regular expression pattern chat-regexp isn’t quite right. That’s why it’s called munging, I suppose. But maybe we should have written test cases.

Finally, let’s go back and package what we’ve learned into a module (and add a test case to let us know that we’ll need to fix something).

"parse-irc.rkt"

#lang racket
 
;; Munging IRC chat logs
 
(require net/url
         rackunit)
 
;; An IRC port contains both actions and chats.
(struct action (hour minute type msg) #:transparent)
(struct chat (hour minute who msg) #:transparent)
 
;; Regular expressions to parse out the lines in a chat log.
(define action-regexp
  #px"^(\\d\\d):(\\d\\d) [(](.+)[)] (.+)")
 
;; FIXME: this pattern is not quite right...
(define chat-regexp
  #px"^(\\d\\d):(\\d\\d) ([^()]+) (.+)")
 
 
;; parse-irc: string -> (U action chat)
(define (parse-irc a-line)
  (match a-line
    [(regexp action-regexp
             (list _ hour minute type msg))
     (action hour minute type msg)]
    [(regexp chat-regexp
             (list _ hour minute who msg))
     (chat hour minute who msg)]
    [else
     (error 'oops-i-did-it-again)]))
 
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 
;; Let's try it out:
(define irc-port
  (get-pure-port
   (string->url "http://racket-lang.org/irc-logs/20110802.txt")))
 
(define parsed-irc
  (for/list ([line (in-lines irc-port)])
    (parse-irc line)))
 
;; We can look at the final results here:
parsed-irc
 
 
;; A test case that, at the present, will fail on us.
(check-equal? (parse-irc "01:42 dyoo: ... And that's a wrap!  See you around!")
              (chat "01" "42" "dyoo" "... And that's a wrap!  See you around!"))

3 Acknowledgements

Thanks to Sam Tobin-Hochstadt for the improved version of parse-irc!