ahocorasick -- search for matches with a keyword tree

[Downloads] [Introduction] [Usage] [Development Details] [Related Modules] [Bugs/Changelog]

Downloads

I'm a bit lazy, so I'll do a rough copy-and-paste from the README. *grin*

Introduction

There was a thread in March 2005 where Andre Soereng mentioned that he wanted something to search text for a set of keywords:

http://mail.python.org/pipermail/python-list/2005-March/268953.html
and I suggested using either suffix trees or an Aho-Corasick automaton. I'd written a wrapper for an implementation of suffix trees before, but I hadn't written one for the Aho-Corasick automaton, so symmetry demanded that I write a wrapper here too. *grin*

Usage

>>> import ahocorasick
>>> tree = ahocorasick.KeywordTree()
>>> tree.add("alpha")          
>>> tree.add("alpha beta")
>>> tree.add("gamma")
>>>
>>> tree.make()
>>>
>>> tree.search("I went to alpha beta the other day to pick up some spam")
(10, 15)
>>> tree.search_long("I went to alpha beta the other day to pick up some spam")
(10, 20)
>>> tree.search("and also got some alphabet soup")
(18, 23)
>>> tree.search("but no waffles")
>>> 
>>> tree.search_long("oh, gamma rays are not tasty")
(4, 9)
>>>
>>> tree.findall("I went to alpha beta to pick up alphabet soup")

>>> for match in tree.findall("I went to alpha beta to pick up alphabet soup"):
...     print match
... 
(10, 15)
(32, 37)

The 'ahocorasick' module provides a single class called KeywordTree. KeywordTree has the following methods:

Development details

The Aho-Corasick automaton is a data structure that can quickly do a multiple-keyword search across text. It's described in the classic paper 'Efficient string matching: an aid to bibliographic search':

http://portal.acm.org/citation.cfm?id=360855&dl=ACM&coll=GUIDE
The majority of the code here was wilfully stolen..., er... "adapted" from source code that I found in the Fairly Fast Packet Filter (FFPF) project:
http://ffpf.sourceforge.net/general/overview.php
One of the filters that they include is a fairly clean and simple C implementation of the Aho-Corasick keyword tree data structure, so I just took that and built a wrapper around it.

Related Modules

Nicolas Lehuen's pytst looks interesting --- I must find some time to play with it! He's implemented a Ternary Search Tree, and his implementation appears tight and efficient.

Bugs / Changelog

Back to Python.