Wednesday, December 14, 2011

Three Apache Hadoop On Windows TechNet Wiki Articles - Home page to FAQ to FTP...

TechNet Articles - Apache Hadoop On Windows

This article contains links to information about using Apache Hadoop on Windows, or with other Microsoft technologies. It also provides a brief overview of Hadoop as well as overview information for the Hadoop offerings provided by Microsoft.

Table of Contents

Topics Content Types
Hadoop Overview How To
Hadoop on Windows Overview Code Examples
Apache Hadoop on Windows Server Videos
Apache Hadoop on Windows Azure Audio
Elastic Map Reduce on Windows Azure  
Learning Hadoop  
Hadoop on Windows  
Hadoop Best Practices  
Managing Hadoop  
Developing with Hadoop  
Using Hadoop with other BI Technologies


Hadoop Overview

Apache Hadoop is an open source software framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It consists of two primary components: Hadoop Distributed File System (HDFS) – a reliable and distributed data storage and MapReduce – a parallel and distributed processing system.

HDFS is the primary distributed storage used by Hadoop applications. As you load data into a Hadoop cluster, HDFS splits up the data into blocks/chunks and creates multiple replicas of blocks and distributes them across the nodes of the cluster to enable reliable and extremely rapid computations.

Hadoop MapReduce is a software framework for writing applications that rapidly process vast amounts of data in parallel on a large cluster of compute nodes. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

Some of the main advantages of the Hadoop are that it can process vast amounts of data, hundreds of terabytes to even petabytes quickly and efficiently, process both structured and non-structured data, perform the processing where the data is rather than moving the data to the processing, and detect and handle failures by design.

There are two other technologies that are related to Hadoop: Hive and Pig. Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems such as HDFS. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.


Hadoop on Windows

The links in this section provide information on deploying Apache Hadoop to Microsoft Windows Platforms.

Link Description
Getting Started with Apache Hadoop for Windows An overview of the Getting Started Guides currently available.
Getting Started Deploying an On-Premise Apache Hadoop Cluster. A walkthrough for deploying Apache Hadoop to a set of servers that you manage.
Getting Started with the Windows Azure Deployment of Apache Hadoop for Windows A walkthrough for deploying Apache Hadoop compute instances on your Windows Azure Subscription.
Getting Started using a Windows Azure Deployment of Hadoop on the Elastic Map Reduce Portal. A walkthrough for provisioning a temporary Apache Hadoop cluster using the Elastic Map Reduce Portal (EMR) Portal.


TechNet Articles - Apache Hadoop Based Services for Windows Azure How To and FAQ Guide

This content is a work in progress for the benefit of the Hadoop Community.

Please feel free to contribute to this wiki page based on your expertise and experience with Hadoop.

For asking questions, please use the Yahoo Group,


  1. Setup your Hadoop on Azure cluster
  2. How to run a job on Hadoop on Azure
  3. Interactive Console
    1. Tasks with the Interactive JavaScript Console
      • How to run Pig-Latin jobs from the Interactive javaScript Console
      • How to create and run a JavaScript Map Reduce Job
    2. Tasks with Hive on the Interactive Console
  4. Remote Desktop
    1. Using the Hadoop command shell
    2. View the Job Tracker
    3. View HDFS
  5. Open Ports
    1. How to connect Excel to Hadoop on Azure via HiveODBC
    2. How to FTP data to Hadoop on Azure
  6. Manage Data
    1. Import Data From Data market
    2. Setup ASV - use your Windows Azure Blob Store account
    3. Setup S3 - use your AMazon S3 account


TechNet Articles - How To FTP Data To Hadoop on Windows Azure

How To FTP Data To Hadoop on Windows Azure

The Apache Hadoop distribution for Windows includes a FTP server that operates directly on the Hadoop Distributed File System (HDFS). The FTPS protocol is used for secure transfers. FTP communication is wire efficient and especially suited for transferring large data set. The steps below describe how to use the FTP server.

  1. Log into the portal on .
  2. Click the Open Ports tile to access the FTP server port configuration.
  3. ...


While I'm not Hadoop'ing yet, when I saw these I knew I wanted to grab them for future reference....


Related Past Post XRef:
Big day for Big Data... Hadoop coming to Windows Azure, Windows Server and SQL Server
Hadoop for the SQL Server DBA...
Do you Hadoop? Angel has your links, news and resources round-up...
Microsoft SQL Server Connector for Apache Hadoop RTW

1 comment:

Kiran Kumar said...

Thanks for detailed article on Hadoop.
Hadoop is framework with combination of different framework like map reduce,hdfs,hbase,hive.
HDFS stores the data blocks in the form of files in cluster nodes.There was no tables,no columns in hdfs.
Map Reduce is the powerful parallel processing of data located in clustered nodes.
Hive is datawarehousing tool and SQL wrapper for processing large amount of data. Hive can be used for olap processing.
HBase is a database on top of hdfs. Hbase can be used for realtime processing i.e OLTP processing.

Please click Why Hadoop is introduced to know more on Basics of Hadoop and different sub components