Friday, April 28, 2006

Aho-Corasick Algorithm String Matching

The Code Project - Aho-Corasick string matching in C#

"In this article, I will describe the implementation of an efficient Aho-Corasick algorithm for pattern matching. In simple words, this algorithm can be used for searching a text for specified keywords. The following code is useful when you have a set of keywords and you want to find all occurrences of a keywords in the text or check if any of the keywords is present in the text. You should use this algorithm especially if you have a large number of keywords that don’t change often, because in this case, it is much more efficient than other algorithms that can be simply implemented using the .NET class library.

Aho-Corasick algorithm

In this section, I’ll try to describe the concept of this algorithm. For more information and for a more exact explanation, please take a look at the links at the end of this article. The algorithm consists of two parts. The first part is the building of the tree from keywords you want to search for, and the second part is searching the text for the keywords using the previously built tree (state machine). Searching for a keyword is very efficient, because it only moves through the states in the state machine.
 
...

I decided to implement this algorithm when I had to ban some words in a community web page (vulgarisms etc.). This is a typical use case because searching should be really fast, but blocked keywords don’t change often (and the creation of the keyword tree can be slower).

...

Conclusion

This implementation of the Aho-Corasick search algorithm is very efficient if you want to find a large number of keywords in a text of any length, but if you want to search only for a few keywords, it is better to use a simple method like String.IndexOf. The code can be compiled in both .NET 1.1 and .NET 2.0 without any modifications. ..."

I know you see me say this allot, but there’s got to be some way I can use this...

Technorati Tags: ,

No comments: