Two new Lucene.Net Articles, Text Analysis and Custom Synonym Analyzer
CodeProject - Lucene.Net - Text Analysis
“Lucene.Net is a high performance Information Retrieval (IR) library, also known as a search engine library. Lucene.Net contains powerful APIs for creating full text indexes and implementing advanced and precise search technologies into your programs. Some people may confuse Lucene.net with a ready to use application like a web search/crawler, or a file search application, but Lucene.Net is not such an application, it's a framework library. Lucene.Net provides a framework for implementing these difficult technologies yourself. Lucene.Net makes no discriminations on what you can index and search, which gives you a lot more power compared to other full text indexing/searching implications; you can index anything that can be represented as text. There are also ways to get Lucene.Net to index HTML, Office documents, PDF files, and much more.
…
What are Analyzers?
An
Analyzer
has a single job, and that is to be a advanced work breaker. Which an object that will read a stream of text and break apart the words into objects calledTokens
. TheToken
class will generally hold the results of the analysis as individual words. This is a very brief summary of what anAnalyzer
can do and how it affects your full text index. A goodAnalyzer
will not only break the words apart, but it is also performs a transformation of the text to make it more suitable for indexing. One simple transformation anAnalyzer
can do is to lowercase everything it comes across, that way your index will be case insensitive.In the Lucene framework there are two major spots where an
Analyzer
is used, and that is when indexing and then searching. For the indexing portion, the direct results of theAnalyzer
is what gets indexed. So for example, in a previous example of anAnalyzer
that will convert everything to lowercase, if we come across the word "CAT", the analyzer will output "cat", and in the full text index, aTerm
of "cat" will be associated with theDocument
. For an even bigger example if we use anAnalyzer
that will break the words apart with the spaces, and then theAnalyzer
will convert it all to lowercase the follow the results should look something like this.…
Attached to this article is the the Analyzer Viewer application, that I made. Attached are both the source and a ready to run binary of the application.. The sample is more like a little utility to see how the basic
Analyzers
included with Lucene.Net will view text. The application will allow you to directly input some text, and it will show you all the results of the text analysis, and how it split them up into tokens and what transformations it applied.Some interesting things to looks at include, typing in email addresses, numbers with letters, numbers alone, acronyms, alternating cases, and just anything else you want to play with to see how the indexing process goes.
…
Implementations of a
Tokenizer
.As i mentioned earlier the
Tokenizer
class is an abstract base class of aTokenStream
. Lucene.Net provides a few implementations of aTokenizer
that it uses in some of the Analyzers. Here is a couple of them and a small description of each.
KeywordTokenizer
- ThisTokenizer
will read the entire stream of text and return the whole things as a singleToken
.…”
CodeProject - Lucene.Net – Custom Synonym Analyzer
“…
How Do I Get Lucene.Net to Work with Synonyms?
The goal here is to be able to search for a word and be able to retrieve results that contain words that have the same meaning as the words you are searching for. This will allow you to be able to kind of search by meaning than search by the keywords.
We can easily get Lucene.Net to work with synonyms by creating a custom
Analyzer
class. TheAnalyzer
will be able to inject the synonyms into the full text index. For some details on the internals of anAnalyzer
, please see my previous article Lucene.Net – Text Analysis.…
Points of Interest
The
SynonymAnalyzer
is really great for indexing, but I think it might junk up a Query if you plan to use theSynonymAnalyzer
for use with aQueryParser
to construct a query. One way around this is to modify theSynonymFilter
, andSynonymAnalyzer
to have a bool switch to turn the synonym injection on and off. That way you could turn the synonym injection off while you are using it with aQueryParser
.The code attached includes the Analyzer Viewer application that I had in my last article, but it also includes an update to include our brand new synonym analyzer.
..”
Two new cool Lucene.net articles from Andrew Smith (blog). I swear that I’m going to use Lucene.Net one of these days… ;)
Related Past Post XRef:
Five pages to getting started with Lucene.Net - Introducing Lucene.Net
Lucene.Net & C# Indexing and Searching WinForm Example
Lucene.Net Resource List – Books, links and API’s, oh my…
LINQ to Lucene
Using Lucene.Net to Index And Search C# Source
Lucene.Net 2.0 Final Released
"DotLucene / Lucene.Net has moved to ASF"
Indexing Database Content with dotLucene
DotLucene: Full-Text Search for Your Intranet or Website using 37 Lines of Code
No comments:
Post a Comment