A Racket Slice: munging IRC chat logs
Danny Yoo <dyoo@hashcollision.org>
Source code can be found at: https://github.com/dyoo/racket-slices. The latest version of this document lives in http://hashcollision.org/racket-slices/irc-parsing.
1 The Problem
Let’s say that we have some source of text, such as IRC chat logs, and we’d like to extract information from them. To use the precise technical term, we’d like to munge.
2 Exploration
The following is an interactive session between us and the Racket REPL, using the full-fledged racket language.
> (require net/url)
> (define irc-port (get-pure-port (string->url "http://racket-lang.org/irc-logs/20110802.txt")))
Note that the following: “get-pure-port” in this document is hyperlinked. When you click the link, it shows where the function lives. In this case, get-pure-port lives in net/url, as does string->url.
> (for ([line (in-lines irc-port)] [i (in-range 10)]) (printf "~s\n" line))
"00:00 (join) neilv"
"00:02 neilv: if i want to install the racket version that is on the path to becoming 5.1.2, do i get it from http://pre.racket-lang.org/installers/ or somewhere else?"
"00:03 neilv: > You are about to download: plt-5.1.2.3-src-unix.tgz (16M)"
"00:03 neilv: seems promising, but I'm just double-checking"
"00:04 offby1: couldn't tell ya"
"00:08 (join) kennyd"
"00:11 (join) asurai"
"00:11 (quit) asurai: Client Quit"
"00:14 neilv: the c part of the build seems a lot faster since the gui ffi stuff, unsurprisingly"
"00:14 (quit) bmp: Ping timeout: 260 seconds"
An IRC administrative action, like "00:00 (join) neilv", or
A chat message, like "00:04 offby1: couldn't tell ya".
> (regexp-match #px"^(\\d\\d):(\\d\\d)" "12:42") '("12:42" "12" "42")
> (regexp-match #px"^(\\d\\d):(\\d\\d)" "twelve:forty-two") #f
You can find more details about regular expressions in the Guide.
> (define action-regexp #px"^(\\d\\d):(\\d\\d) [(](.+)[)] (.+)")
> (define chat-regexp #px"^(\\d\\d):(\\d\\d) ([^()]+) (.+)")
> (for ([line (in-lines irc-port)] [i (in-range 5)]) (cond [(regexp-match action-regexp line) (printf "I matched an action.\n")] [(regexp-match chat-regexp line) (printf "I matched a message\n")] [else (error 'oops-i-did-it-again)]))
I matched a message
I matched an action.
I matched an action.
I matched an action.
I matched an action.
... uh. Probably not. We’re certainly munging.
> (struct action (hour minute type msg) #:transparent) > (struct chat (hour minute who msg) #:transparent)
> (define (parse-irc a-line) (define (on-action-line a-match) (action (second a-match) (third a-match) (fourth a-match) (fifth a-match))) (define (on-chat-line a-match) (chat (second a-match) (third a-match) (fourth a-match) (fifth a-match))) (cond [(regexp-match action-regexp a-line) => on-action-line] [(regexp-match chat-regexp a-line) => on-chat-line] [else (error 'oops-i-did-it-again)]))
> (parse-irc (read-line irc-port)) (action "01" "15" "join" "neilv")
> (parse-irc (read-line irc-port)) (action "01" "30" "quit" "dnolen: Quit: dnolen")
Hmmm. But in retrospect, though, using second, third, etc. is a slightly verbose, error-prone way to destructure the list that we’re getting back from regexp-match. Can we do better?
See the documentation of racket/match for more information on the pattern-matching library.
> (define (parse-irc a-line) (match a-line [(regexp action-regexp (list _ hour minute type msg)) (action hour minute type msg)] [(regexp chat-regexp (list _ hour minute who msg)) (chat hour minute who msg)] [else (error 'oops-i-did-it-again)]))
> (parse-irc (read-line irc-port)) (action "01" "34" "quit" "jonrafkind: Ping timeout: 250 seconds")
> (parse-irc (read-line irc-port)) (action "01" "44" "join" "hkBst")
Ok, better. We can probably keep at it to make parse-irc even smaller, but we should probably stop fiddling with it.
> (for ([line (in-lines irc-port)] [i (in-range 5)]) (printf "~s\n" (parse-irc line)))
#(struct:action "01" "57" "quit" "dherman: Quit: dherman")
#(struct:action "02" "14" "quit" "hussaibi: Ping timeout: 252 seconds")
#(struct:action "02" "14" "quit" "hussaibi_: Ping timeout: 252 seconds")
#(struct:action "03" "05" "quit" "sheikra: Ping timeout: 255 seconds")
#(struct:action "03" "16" "quit" "neilv: Ping timeout: 255 seconds")
> (define chat-lines (for/list ([line (in-lines irc-port)] #:when (chat? (parse-irc line))) (parse-irc line)))
> (for ([a-chat chat-lines] [i (in-range 5)]) (printf "~s\n" a-chat))
#(struct:chat "08" "50" "RacketCommitBot: [racket] plt pushed 1 new commit to master:" "https://github.com/plt/racket/commit/fba1777b8ad6d84200e17c85896f9f6d210b0d1d")
#(struct:chat "08" "50" "RacketCommitBot: [racket/master] fix contract - Matthew" "Flatt")
#(struct:chat "12" "26" "ChibaPet: Eli, are you" "about?")
#(struct:chat "12" "59" "clklein: I was pleasantly surprised to see that `for-template' works in `provide' but now unpleasantly confused about how to use" "it.")
#(struct:chat "13" "01" "drdo: net/imap doesn't support SEARCH or am i missing" "something?")
No “maybe” about it. See rackunit for more details on how to write unit test cases.
"parse-irc.rkt"
#lang racket ;; Munging IRC chat logs (require net/url rackunit) ;; An IRC port contains both actions and chats. (struct action (hour minute type msg) #:transparent) (struct chat (hour minute who msg) #:transparent) ;; Regular expressions to parse out the lines in a chat log. (define action-regexp #px"^(\\d\\d):(\\d\\d) [(](.+)[)] (.+)") ;; FIXME: this pattern is not quite right... (define chat-regexp #px"^(\\d\\d):(\\d\\d) ([^()]+) (.+)") ;; parse-irc: string -> (U action chat) (define (parse-irc a-line) (match a-line [(regexp action-regexp (list _ hour minute type msg)) (action hour minute type msg)] [(regexp chat-regexp (list _ hour minute who msg)) (chat hour minute who msg)] [else (error 'oops-i-did-it-again)])) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; Let's try it out: (define irc-port (get-pure-port (string->url "http://racket-lang.org/irc-logs/20110802.txt"))) (define parsed-irc (for/list ([line (in-lines irc-port)]) (parse-irc line))) ;; We can look at the final results here: parsed-irc ;; A test case that, at the present, will fail on us. (check-equal? (parse-irc "01:42 dyoo: ... And that's a wrap! See you around!") (chat "01" "42" "dyoo" "... And that's a wrap! See you around!"))
3 Acknowledgements
Thanks to Sam Tobin-Hochstadt for the improved version of parse-irc!