Tesseract OCR - Released as Open Source - Greg's Cool [Insert Clever Name] of the Day

Wednesday, September 06, 2006

Tesseract OCR - Released as Open Source

"We wanted to let you all know that a few months ago we quietly released - or actually re-released - an Optical Character Recognition (OCR) engine into open source. You might wonder why Google is interested in OCR? In a nutshell, we are all about making information available to users, and when this information is in a paper document, OCR is the process by which we can convert the pages of this document into text that can then be used for indexing.

This particular OCR engine, called Tesseract, was in fact not originally developed at Google! It was developed at Hewlett Packard Laboratories between 1985 and 1995....

...A few things to know about Tesseract OCR: for now it only supports the English language, and does not include a page layout analysis module (yet), so it will perform poorly on multi-column material. It also doesn't do well on grayscale and color documents, and it's not nearly as accurate as some of the best commercial OCR packages out there. Yet, as far as we know, despite its shortcomings, Tesseract is far more accurate than any other Open Source OCR package out there..."

This could be an interesting project...

Currently the SourceForge download doesn't include a binary and won't compile for me (it seems to me missing a ccutil\mfcpch.cpp?) and I don't have the bandwidth to CVS the source.

Still now with Google behind it, I'm adding this project to my watch list...

(via TheMadAdmin - Google Open Sources an OCR program.)