Friday, December 02, 2005

"Converting PDF to Text in C#" with PDFBox/IVKM.Net

Converting PDF to Text in C# - The Code Project - C# Programming

"How to parse PDF files?

When extending the indexing solution for an intranet built using the DotLucene fulltext search library I decided to add support for PDF files. But DotLucene can only handle plain text so the PDF files needed to be converted.

After hours of googling I found a reasonable solution which uses 'pure' .NET - at least there are no other dependencies than a few assemblies of IKVM.NET. Before we start with the solution let's take a look at the other ways I tried..."


This is a cool example for extracting text from PDF's using PDFBox. I like how the author talks about the other methods he tried, Adobe PDF IFilter and ITextSharp, before ending up with PDFBox.

This is the first time I've seen PDFBox. It looks pretty cool and something I might be able to use....

From the PDFBox site:
"PDFBox is an open source Java PDF library for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. PDFBox also includes several command line utilities.

Features
PDF to text extraction
Merge PDF Documents
PDF Document Encryption/Decryption
Lucene Search Engine Integration
Fill in form data FDF and XFDF
Create a PDF from a text file
Create images from PDF pages
Print a PDF"


PDFBox - .NET Version
"Even though PDFBox is written in Java, there is also a .NET version that is available. It utilizes IKVM to create a fully functioning PDF library for the .NET framework. The released version contains a bin directory with all of the required DLL files. For the command line applications that are available in the Java version a native windows executable is also included. This page contains information that is specific to using the .NET version of PDFBox. "

Some interesting stuff...

Related Past Post XRef:
Java Implementation for Mono/.Net (IVKM.Net)
iTextSharp - PDF Lib for .Net

1 comment:

Anonymous said...

Thanks for this. I got solution. For complete example(docx to pdf) this may help you. http://aspnettutorialonline.blogspot.com/2012/05/how-to-convert-docxword-to-pdf-in.html