Friday, January 09, 2009

Using Tesseract (Open Source OCR from Google) as a tool to learning .Net console application launching and monitoring

Rick Minerich's Development Wonderland - Processes in .NET Part 3 – Interfacing With Simple Console Programs by Example, Tesseract OCR

“Did you know that .NET provides an easy way to interact and control console programs?  In this article I will walk you through this process by creating a wrapper class for Google’s Tesseract OCR application.  At the end of this post, I will provide a complete WinForms-based frontend for Google’s Tesseract OCR Engine

You would think that this is a particularly simple case as Tesseract only needs to be passed in parameters and requires no flow control.  Ideally, we will simply leverage the Process class to control how Tesseract is launched and read from it’s output.  Initially, this is only a small jump from what we learned in Processes in .NET Part 2.  The only real difference here is that instead of using Verbs we are specifying behavior through the ProcessStartInfo’s Arguments property

Unfortunately, while this very simple example will work in many cases, this is not one.  This is because Tesseract.exe secretly launches a separate process and immediately exits.  This makes the WaitForExit() call look like it was successful but, as OCR takes a while, when you try to read from the output file it will either not yet exist or it will be locked for writing by the Tesseract process.

There are many different ways to approach this problem.  In this case an easy method would be to try repeatedly to access Tesseract’s log file using a timeout to ensure our program doesn't lock up. …

Designing a Wrapper Class

Tesseract has a number of quirks which makes it somewhat annoying to deal with, at least when compared with most other command line applications.  It’s important to be on the lookout for these kinds of small quirks when building an interface to an application.  For completeness, I’ll list what I’ve found for Tesseract here along with solutions.

An Asynchronous Wrapper for Easy WinForms Integration

Once you have all of the little quirks of your application covered, the only issue left is that calling your ExtractText method leaves your application locked up for it has returned.  The best way to deal with this is to use an DynamicInvoke on a delegate and managing the update to your console application via a callback.  To make this easy I wrote an asynchronous child class.

…”

There’s a ton of cool lessons in this post, from dealing with Tesseract, to handling unusual command line/console app’s to writing non-blocking WinForm code…

(via Reflective Perspective - The Morning Brew #261)

 

Related Past Post XRef:
Tesseract 1.01
Tesseract OCR - Released as Open Source
.NET and integration with BCP

1 comment:

Naveen said...

Can Use this application to OCR Pdf documents