Sunday, September 07, 2008

SpeedyFx, MASH and Fast Text Extraction for indexing, searching and classification

Extremely Fast Text Feature Extraction for Classification and Indexing

“Most research in speeding up text mining involves algorithmic improvements to induction algorithms, and yet for many large scale applications, such as classifying or indexing large document repositories, the time spent extracting word features from texts can itself greatly exceed the initial training time. This paper describes a fast method for text feature extraction that folds together Unicode conversion, forced lowercasing, word boundary detection, and string hash computation. We show empirically that our integer hash features result in classifiers with equivalent statistical performance to those built using string word features, but require far less computation and less memory.

7. CONCLUSIONS & FUTURE WORK
We have shown that using SpeedyFx integer hashes in place of actual words is faster, requires less memory for transmission and use of multiple classifiers, and has an effect on classification performance that is practically noise compared to the effect of other common parameters in model selection. We showed that MASH has strong, uniform performance for words, though not appropriate for long blocks of text. Moreover, we have demonstrated the ability to classify a text by many classifiers many times faster than is possible with typical code.

…”

There’s some interesting work here in this 16 page PDF from HP Labs that I’d like to take a closer look at, given the time. We do a great deal of text extraction and anything that can speed that up would be welcome…

(via Complex Discovery - Extremely Fast Text Feature Extraction)

No comments: