Sunday, May 09, 2010

HTML Agility Pack gets LINQ’able (and more) in the new v1.4 release

Html Agility Pack - 1.4.0 Stable

“…

Release Notes

1.4.0 Adds some serious new features to Html Agility Pack to make it work nicer in a LINQ driven .NET World. The HtmlNodeCollection and HtmlAttributeCollection now generic ILists and expose IEnumerable<T> methods to mimic LINQ to XML. This opens an alternative to XPATH for querying the HTML tree. Beyond this 1.4.0 introduces tons of code cleanups and removal of all old non-generic classes (no more arraylists :).

1.4.0 also brings basic msdn like documentation and a new program called HAP Explorer for viewing the HTML tree.

Changes from Beta 2.

  • The biggest changes are better support for character encoding and support for medium trust environments.
  • Removed DescendantNodes() function since it was identical to the Descendants() function.
  • Patch# 4706. Added UserAgent property to HtmlWeb class to be used in webrequests. Minor update to code supplied by radicull
  • Patch# 4432 . Applied HtmlEntity decoding of UniCode html entities supplied by tsai
  • Patch# 4396. Applied UTF-8 changes from JudahGabriel
  • Applied JonGalloways HAPExplorer patch
  • Added Visual Studio 2010 Beta 2 Solution
  • Fixed compatibility in Medium Trust environments. Added a default list of extensions and content types to be used when the registry is not available.
  • Updated Charset detection to use a Dictionary<string,string> instead of arraylists of NameValuePair
  • search tag support in HAPExplorer

…”

CodePlex - Html Agility Pack

What is exactly the Html Agility Pack (HAP)?
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Html Agility Pack now supports Linq to Objects (via a LINQ to Xml Like interface). Check out the new beta to play with this feature

Sample applications:

  • Page fixing or generation. You can fix a page the way you want, modify the DOM, add nodes, copy nodes, well... you name it.
  • Web scanners. You can easily get to img/src or a/hrefs with a bunch XPATH queries.
  • Web scrapers. You can easily scrap any existing web page into an RSS feed for example, with just an XSLT file serving as the binding. An example of this is provided.

There is no dependency on anything else than .Net's XPATH implementation. There is no dependency on Internet Explorer's MSHTML dll or W3C's HTML tidy or ActiveX / COM object, or anything like that. There is also no adherence to XHTML or XML, although you can actually produce XML using the tool. The version posted here on CodePlex is for the .NET Framework 2.0. If you need the old version, please go to the old page or drop me a note.

…”

I’ve used this project for a number of years and it’s saved me countless hours… If you’re doing HTML parsing then you owe it to yourself to check out this project

(via J-Maxx Net - Html Agility Pack 1.4.0 Released)

 

Related Past Post XRef:
HTML Agility Pack (Think Managed HTML DOM Parser, with XPath)
.NET Html Agility Pack

No comments: