Friday, January 18, 2008

HTML Agility Pack (Think Managed HTML DOM Parser, with XPath)

I came across the HTML Agility Pack while researching an issue I was having using the XMLReader to parse a HTML page...

The issue was that the HTML was not well formed (imagine that). So I needed a more forgiving HTML parser. I didn't want to falling back to the WebControl, COM or interop with MSHTML. I just wanted a "HTMLReader" or simple HTML DOM parser.

That's what lead me to the HTML Agility Pack...

CodePlex - HTML Agility Pack

"...

Now, erhh... what is exactly the Html Agility Pack? All right, all right, I will tell you know:

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
Sample applications:

  • Page fixing or generation. You can fix a page the way you want, modify the DOM, add nodes, copy nodes, well... you name it.
  • Web scanners. You can easily get to img/src or a/hrefs with a bunch XPATH queries.
  • Web scrapers. You can easily scrap any existing web page into an RSS feed for example, with just an XSLT file serving as the binding. An example of this is provided.


There is no dependency on anything else than .Net's XPATH implementation. There is no dependency on Internet Explorer's MSHTML dll or W3C's HTML tidy or ActiveX / COM object, or anything like that. There is also no adherence to XHTML or XML, although you can actually produce XML using the tool. ..."

With very little work I was able to rip out XMLReader and replace it with the HTML Agility Pack. And in the end, it let me write better code with fewer lines... And so far it's worked like a charm.

No comments: