Monday, January 05, 2009

Two new Lucene.Net Articles, Text Analysis and Custom Synonym Analyzer

CodeProject - Lucene.Net - Text Analysis

“Lucene.Net is a high performance Information Retrieval (IR) library, also known as a search engine library. Lucene.Net contains powerful APIs for creating full text indexes and implementing advanced and precise search technologies into your programs. Some people may confuse Lucene.net with a ready to use application like a web search/crawler, or a file search application, but Lucene.Net is not such an application, it's a framework library. Lucene.Net provides a framework for implementing these difficult technologies yourself. Lucene.Net makes no discriminations on what you can index and search, which gives you a lot more power compared to other full text indexing/searching implications; you can index anything that can be represented as text. There are also ways to get Lucene.Net to index HTML, Office documents, PDF files, and much more.

What are Analyzers?

An Analyzer has a single job, and that is to be a advanced work breaker. Which an object that will read a stream of text and break apart the words into objects called Tokens. The Token class will generally hold the results of the analysis as individual words. This is a very brief summary of what an Analyzer can do and how it affects your full text index. A good Analyzer will not only break the words apart, but it is also performs a transformation of the text to make it more suitable for indexing. One simple transformation an Analyzer can do is to lowercase everything it comes across, that way your index will be case insensitive. 

In the Lucene framework there are two major spots where an Analyzer is used, and that is when indexing and then searching. For the indexing portion, the direct results of the Analyzer is what gets indexed. So for example, in a previous example of an Analyzer that will convert everything to lowercase, if we come across the word "CAT", the analyzer will output "cat", and in the full text index, a Term of "cat" will be associated with the Document. For an even bigger example if we use an Analyzer that will break the words apart with the spaces, and then the Analyzer will convert it all to lowercase the follow the results should look something like this.

Attached to this article is the the Analyzer Viewer application, that I made. Attached are both the source and a ready to run binary of the application.. The sample is more like a little utility to see how the basic Analyzers included with Lucene.Net will view text. The application will allow you to directly input some text, and it will show you all the results of the text analysis, and how it split them up into tokens and what transformations it applied.

Some interesting things to looks at include, typing in email addresses, numbers with letters, numbers alone, acronyms, alternating cases, and just anything else you want to play with to see how the indexing process goes. 

Implementations of a Tokenizer.

As i mentioned earlier the Tokenizer class is an abstract base class of a TokenStream. Lucene.Net provides a few implementations of a Tokenizer that it uses in some of the Analyzers. Here is a couple of them and a small description of each.

KeywordTokenizer - This Tokenizer will read the entire stream of text and return the whole things as a single Token.

…”

CodeProject - Lucene.Net – Custom Synonym Analyzer

“…

How Do I Get Lucene.Net to Work with Synonyms?

The goal here is to be able to search for a word and be able to retrieve results that contain words that have the same meaning as the words you are searching for. This will allow you to be able to kind of search by meaning than search by the keywords.

We can easily get Lucene.Net to work with synonyms by creating a custom Analyzer class. The Analyzer will be able to inject the synonyms into the full text index. For some details on the internals of an Analyzer, please see my previous article Lucene.Net – Text Analysis.

Points of Interest

The SynonymAnalyzer is really great for indexing, but I think it might junk up a Query if you plan to use the SynonymAnalyzer for use with a QueryParser to construct a query. One way around this is to modify the SynonymFilter, and SynonymAnalyzer to have a bool switch to turn the synonym injection on and off. That way you could turn the synonym injection off while you are using it with a QueryParser.

The code attached includes the Analyzer Viewer application that I had in my last article, but it also includes an update to include our brand new synonym analyzer.

..”

 

Two new cool Lucene.net articles from Andrew Smith (blog). I swear that I’m going to use Lucene.Net one of these days… ;)

 

Related Past Post XRef:
Five pages to getting started with Lucene.Net - Introducing Lucene.Net
Lucene.Net & C# Indexing and Searching WinForm Example
Lucene.Net Resource List – Books, links and API’s, oh my…
LINQ to Lucene
Using Lucene.Net to Index And Search C# Source
Lucene.Net 2.0 Final Released
"DotLucene / Lucene.Net has moved to ASF"
Indexing Database Content with dotLucene
DotLucene: Full-Text Search for Your Intranet or Website using 37 Lines of Code

No comments: