Monday, March 20, 2006

"Simple Crawler using C# sockets"

The Code Project - Simple Crawler using C# sockets

"Introduction

A web crawler (also known as a web spider or ant) is a program, which browses the World Wide Web in a methodical, automated manner. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam).

 
Crawler Overview

In this article I introduced a simple Web crawler with a simple interface to describe the crawling story in a simple C# program. My crawler takes the input interface of any Internet navigator to simplify the process. The user just input the URL to be crawled in the navigation bar, and click "Go".

The Crawler has a URL queue that is equivalent to the URL server in any Large Scale Search Engine. The Crawler works with multiple threads to fetch URLs from the Crawler queue. Then the retrieved pages saved in the storage area as in the previous figure.

The fetched URL are requested from the Web using C# Sockets library to avoid locking in any other C# libraries. Retrieved pages are parsed to extract new URL references to be put in the Crawler queue again to a certain depth defined in the Settings.

In the next sections I will describe the program views and discuss some technical points related to the interface.
 
..."

Sounds like something I might be able to use...

One of my personal projects is write a Blogger Backup/Export utility. The problem is that all the Blogger API’s and components I’ve found have a 100 or so post limit. That kind of hobbles me, having 1,695 posts. So I’ve resigned myself to having to crawl my blog and manually extract the data.

This Code Project sounds like something that will help me a good bit on my way...

Technorati Tags: ,  

4 comments:

Thomas said...

G'day Greg,

When you do complete this project, let me know - I've been looking forward to saving my Blogger posts in some way* for a while now.

Cheers,

Thomas

* My ideal would be a diary-type format, printable, e.g. "Williams World Blog - the first 3 years", just so if Blogger goes down, I don't lose all my thoughts and ramblings :-)

Greg said...

Sure thing...

There's no ETA or anything, but like you, I'd like a easily usable/browsable blog backup/archive sooner than later...

Anonymous said...

I actually went through an export process with blogger. What I did is insert comments like [!--beginTitle--] and [!--endTitle--] for all the things I wanted to extract, then I simply republished. I then took those static files and using the language of my choice (PHP) extracted the lot to sql statements for import into my new WordPRess install. The key is to know what metadata to grab out, and since my early posts had no titles, I had to sort of "fake" the titles for WordPress.

Greg said...

Cool...

I'd really like to try to avoid template changes (even such low impact ones like you used).

I'm thinking of trying to use the different CSS class names to help parse out the contents.

For example, look for the h3 tag, with the CSS class of "post-title" and then grabber the innerhtml as the title. The div with the "post-body" class, etc, etc.

At least that's my plan....

It probably won't work for those with highly modified templates, but for the average blogger blog I hope to be able to point the utility at a blogger URL, select an output format (PDF, html, DB, BlogML, etc) and click Go...

Grand plans and all that... ;)