Monday, April 16, 2012

Get that off the screen... Using HAP (HTML Agility Pack) on WP7 to webscrape

Timdams's Blog - Writing a WP7 website scraper application

"In this tutorial I will explain how you write a WP7 application using the HtmlAgility Pack in order to use information scraped from a website.

Website scraping is the act of retrieving information from a website page. An act by some considered stealing, by others borrowing. Let’s leave that debate to the others. In this post I will show how easy it is to scrape content from a website so that you can (re)use it in your Windows Phone 7 application. As it is, this information will for the most part also work in other, non WP7, projects of course.
Sometimes website scraping is the only means available to consume certain information from a website. If the website doesn’t have some publicly available API or web service you can use you’re pretty much left with scraping, whether you like it or not.

Now before reading on, it is extremely important to understand that there are legal issues concerning scraping: basically, as far as I understand it, you’re only allowed to use scraped data if you have clearance to do so by the website owner (i.e. the one that ‘owns’ the data).



HAP is my favorite HTML utilities, one that I use all to often (sometimes when I don't even need too... When I think HTML parsing, I just naturally jump to HAP.). And while webscraping can be very fragile, sometimes it's all you can do to get the data you need. And HAP just makes it almost too easy. If you are parsing HTML, webscraping, etc, and you're not using HAP, you should give it a look much sooner than later.


Related Past Post XRef:
HTML Agility Pack gets LINQ’able (and more) in the new v1.4 release
HTML Agility Pack (Think Managed HTML DOM Parser, with XPath)
.NET Html Agility Pack

No comments: