CodeProject - Lucene.Net - Text Analysis
“Lucene.Net is a high performance Information Retrieval (IR) library, also known as a search engine library. Lucene.Net contains powerful APIs for creating full text indexes and implementing advanced and precise search technologies into your programs. Some people may confuse Lucene.net with a ready to use application like a web search/crawler, or a file search application, but Lucene.Net is not such an application, it's a framework library. Lucene.Net provides a framework for implementing these difficult technologies yourself. Lucene.Net makes no discriminations on what you can index and search, which gives you a lot more power compared to other full text indexing/searching implications; you can index anything that can be represented as text. There are also ways to get Lucene.Net to index HTML, Office documents, PDF files, and much more.
…
What are Analyzers?
An Analyzer
has a single job, and that is to be a advanced work breaker. Which an object that will read a stream of text and break apart the words into objects called Tokens
. The Token
class will generally hold the results of the analysis as individual words. This is a very brief summary of what an Analyzer
can do and how it affects your full text index. A good Analyzer
will not only break the words apart, but it is also performs a transformation of the text to make it more suitable for indexing. One simple transformation an Analyzer
can do is to lowercase everything it comes across, that way your index will be case insensitive.
In the Lucene framework there are two major spots where an Analyzer
is used, and that is when indexing and then searching. For the indexing portion, the direct results of the Analyzer
is what gets indexed. So for example, in a previous example of an Analyzer
that will convert everything to lowercase, if we come across the word "CAT", the analyzer will output "cat", and in the full text index, a Term
of "cat" will be associated with the Document
. For an even bigger example if we use an Analyzer
that will break the words apart with the spaces, and then the Analyzer
will convert it all to lowercase the follow the results should look something like this.
…
Attached to this article is the the Analyzer Viewer application, that I made. Attached are both the source and a ready to run binary of the application.. The sample is more like a little utility to see how the basic Analyzers
included with Lucene.Net will view text. The application will allow you to directly input some text, and it will show you all the results of the text analysis, and how it split them up into tokens and what transformations it applied.
Some interesting things to looks at include, typing in email addresses, numbers with letters, numbers alone, acronyms, alternating cases, and just anything else you want to play with to see how the indexing process goes.
…
Implementations of a Tokenizer
.
As i mentioned earlier the Tokenizer
class is an abstract base class of a TokenStream
. Lucene.Net provides a few implementations of a Tokenizer
that it uses in some of the Analyzers. Here is a couple of them and a small description of each.
KeywordTokenizer
- This Tokenizer
will read the entire stream of text and return the whole things as a single Token
.
…”
CodeProject - Lucene.Net – Custom Synonym Analyzer
“…
How Do I Get Lucene.Net to Work with Synonyms?
The goal here is to be able to search for a word and be able to retrieve results that contain words that have the same meaning as the words you are searching for. This will allow you to be able to kind of search by meaning than search by the keywords.
We can easily get Lucene.Net to work with synonyms by creating a custom Analyzer
class. The Analyzer
will be able to inject the synonyms into the full text index. For some details on the internals of an Analyzer
, please see my previous article Lucene.Net – Text Analysis.
…
Points of Interest
The SynonymAnalyzer
is really great for indexing, but I think it might junk up a Query if you plan to use the SynonymAnalyzer
for use with a QueryParser
to construct a query. One way around this is to modify the SynonymFilter
, and SynonymAnalyzer
to have a bool switch to turn the synonym injection on and off. That way you could turn the synonym injection off while you are using it with a QueryParser
.
The code attached includes the Analyzer Viewer application that I had in my last article, but it also includes an update to include our brand new synonym analyzer.
..”
Two new cool Lucene.net articles from Andrew Smith (blog). I swear that I’m going to use Lucene.Net one of these days… ;)
Related Past Post XRef:
Five pages to getting started with Lucene.Net - Introducing Lucene.Net
Lucene.Net & C# Indexing and Searching WinForm Example
Lucene.Net Resource List – Books, links and API’s, oh my…
LINQ to Lucene
Using Lucene.Net to Index And Search C# Source
Lucene.Net 2.0 Final Released
"DotLucene / Lucene.Net has moved to ASF"
Indexing Database Content with dotLucene
DotLucene: Full-Text Search for Your Intranet or Website using 37 Lines of Code