Tuesday, July 12, 2011

That’s a hOOt! A from scratch, C# based full text indexer and search engine

CodeProject - hOOt - full text search engine

“Introduction

hOOt is a extremely small size and fast embedded full text search engine for .net built from scratch using an inverted WAH bitmap index. Most people are familiar with an Apache project by the name of Lucene.net which is a port of the original java version. Many people have complained in the past why the .net version of lucene is not maintained, and many unsupported ports of the original exists. To circumvent this I have created this project which does the same job, is smaller, simpler and faster. hOOt is part of my upcoming RaptorDB document store database, and was so successful that I decided to release it as a separate entity in the meantime.
hOOt uses the following articles :

Based on the response and reaction of users to this project, I will upgrade and enhance hOOt to full feature compatibility with lucene.net, so show your love.

Why Another Full Text Indexer?

I was always fascinated by how Google searches in general and lucene indexing technique and its internal algorithms, but it was just too difficult to follow and anyone who has worked with lucene.net will attest that it is a complicated and convoluted piece of code. While some people are trying to create a more .net optimized version, the fact of the matter is that it is not easy to do with that code base. What amazes me is that nobody has rewritten it from scratch. hOOt is much simpler, smaller and faster than lucene.net.

One of the reasons for creating hOOt was for implementing full text search on string columns in RaptorDB - the document store version. Hopefully more people will be able to use and extend hOOt instead of lucene.net as it is much easier to understand and change.

Features

hOOt has been built with the following features in mind:

  • Blazing fast operating speed (see performance test section)
  • Incredibly small code size.
  • Uses WAH compressed BitArrays to store information.
  • Multi-threaded implementation meaning you can query while indexing.
  • Tiny size only 38kb DLL (lucene.net is ~300kb).
  • Highly optimized storage, typically ~30% smaller than lucene.net (the more in the index the greater the difference).
  • Query strings are parsed on spaces with the AND operator (e.g. all words must exist).

Limitations

The following limitations are in this release:

7-12-2011 12-13-16 PM

What I really liked about this project was that he not only provided some cool functionality and code (and the implementation of the IFilter stuff), but also explained the concepts behind the code.

What I could help thinking about was how cool would this project be if it was mashed up with, Interactive WinForm Tag Cloud Control (Think “Cool, I can add a Word/Tag Cloud thing to my WinForm app!”)?… hum…

No comments: