Sunday, April 27, 2008

Linq to HTML via Linq to XML

Beth Massi - Sharing the goodness that is VB - Querying HTML with LINQ to XML

"Often times we need to parse HTML for data. Sure in a perfect world everything would have a nice service or API wrapped around it but as we all know this is not always the case. Many times we're left with parsing files or "screen scraping" to get the data we need from other applications. Sure this is brittle, but sometimes it's the best we can do. And sometimes you're just trying to get the data once so "good enough" is really good enough.

I was faced with that challenge myself this week. Yes even here not all systems expose services or if they do, finding the documentation or person to consult would take longer than writing a simple program. ;-) At the core all I needed to do was query a couple pieces of data from a bunch of web pages. This seemed like the perfect opportunity to use LINQ to XML because the structure of the page was pretty well formed HTML. However there were a couple tricks to figure out mainly because LINQ to XML doesn't support HTML entities. It only supports character entities and the built in XML entities (< > " & ').

...

And that's it. For this simple utility this is good enough for me and took me about 15 minutes to program using LINQ. The trick to loading the HTML document into an XElement is to remove all the unsupported HTML entity references first.

Enjoy!"

This is a cool hack/trick... The challenge, as Beth points out, is dealing with malformed HTML and reserved XML characters. But even there, she provides a method of dealing with the XML character side of the problem.

The next time I need a a quick and dirty way to Linq to HTML I'm going to give this a go...

1 comment:

keith said...

Hi, you can get a full LINQ to HTML library based on LINQ to XML but with tag-soup parsing here:

http://www.justagile.com/linq-to-html.aspx