Converting PDF to Text in C# - The Code Project - C# Programming
"How to parse PDF files?
When extending the indexing solution for an intranet built using the DotLucene fulltext search library I decided to add support for PDF files. But DotLucene can only handle plain text so the PDF files needed to be converted.
After hours of googling I found a reasonable solution which uses 'pure' .NET - at least there are no other dependencies than a few assemblies of IKVM.NET. Before we start with the solution let's take a look at the other ways I tried..."
This is a cool example for extracting text from PDF's using PDFBox. I like how the author talks about the other methods he tried, Adobe PDF IFilter and ITextSharp, before ending up with PDFBox.
This is the first time I've seen PDFBox. It looks pretty cool and something I might be able to use....
From the PDFBox site:
"PDFBox is an open source Java PDF library for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. PDFBox also includes several command line utilities.
Features
PDF to text extraction
Merge PDF Documents
PDF Document Encryption/Decryption
Lucene Search Engine Integration
Fill in form data FDF and XFDF
Create a PDF from a text file
Create images from PDF pages
Print a PDF"
PDFBox - .NET Version
"Even though PDFBox is written in Java, there is also a .NET version that is available. It utilizes IKVM to create a fully functioning PDF library for the .NET framework. The released version contains a bin directory with all of the required DLL files. For the command line applications that are available in the Java version a native windows executable is also included. This page contains information that is specific to using the .NET version of PDFBox. "
Some interesting stuff...
Related Past Post XRef:
Java Implementation for Mono/.Net (IVKM.Net)
iTextSharp - PDF Lib for .Net
Friday, December 02, 2005
1 comment:
NOTE: Anonymous Commenting has been turned off for a while... The comment spammers are just killing me...
ALL comments are moderated. I will review every comment before it will appear on the blog.
Your comment WILL NOT APPEAR UNTIL I approve it. This may take some hours...
I reserve, and will use, the right to not approve ANY comment for ANY reason. I will not usually, but if it's off topic, spam (or even close to spam-like), inflammatory, mean, etc, etc, well... then...
Please see my comment policy for more information if you are interested.
Thanks,
Greg
PS. I am proactively moderating comments. Your comment WILL NOT APPEAR UNTIL I approve it. This may take some hours...
Thanks for this. I got solution. For complete example(docx to pdf) this may help you. http://aspnettutorialonline.blogspot.com/2012/05/how-to-convert-docxword-to-pdf-in.html
ReplyDelete