This article contains links to information about using Apache Hadoop on Windows, or with other Microsoft technologies. It also provides a brief overview of Hadoop as well as overview information for the Hadoop offerings provided by Microsoft.
Topics Content Types Hadoop Overview How To Hadoop on Windows Overview Code Examples Apache Hadoop on Windows Server Videos Apache Hadoop on Windows Azure Audio Elastic Map Reduce on Windows Azure Learning Hadoop General Hadoop on Windows Hadoop Best Practices Managing Hadoop Developing with Hadoop Using Hadoop with other BI Technologies
Apache Hadoop is an open source software framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It consists of two primary components: Hadoop Distributed File System (HDFS) – a reliable and distributed data storage and MapReduce – a parallel and distributed processing system.
HDFS is the primary distributed storage used by Hadoop applications. As you load data into a Hadoop cluster, HDFS splits up the data into blocks/chunks and creates multiple replicas of blocks and distributes them across the nodes of the cluster to enable reliable and extremely rapid computations.
Hadoop MapReduce is a software framework for writing applications that rapidly process vast amounts of data in parallel on a large cluster of compute nodes. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
Some of the main advantages of the Hadoop are that it can process vast amounts of data, hundreds of terabytes to even petabytes quickly and efficiently, process both structured and non-structured data, perform the processing where the data is rather than moving the data to the processing, and detect and handle failures by design.
There are two other technologies that are related to Hadoop: Hive and Pig. Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems such as HDFS. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
The links in this section provide information on deploying Apache Hadoop to Microsoft Windows Platforms.
Link Description Getting Started with Apache Hadoop for Windows An overview of the Getting Started Guides currently available. Getting Started Deploying an On-Premise Apache Hadoop Cluster. A walkthrough for deploying Apache Hadoop to a set of servers that you manage. Getting Started with the Windows Azure Deployment of Apache Hadoop for Windows A walkthrough for deploying Apache Hadoop compute instances on your Windows Azure Subscription. Getting Started using a Windows Azure Deployment of Hadoop on the Elastic Map Reduce Portal. A walkthrough for provisioning a temporary Apache Hadoop cluster using the Elastic Map Reduce Portal (EMR) Portal.
This content is a work in progress for the benefit of the Hadoop Community.
Please feel free to contribute to this wiki page based on your expertise and experience with Hadoop.
For asking questions, please use the Yahoo Group, http://tech.groups.yahoo.com/group/hadooponazurectp/
- Setup your Hadoop on Azure cluster
- How to run a job on Hadoop on Azure
- Interactive Console
- Tasks with Hive on the Interactive Console
- Remote Desktop
- Using the Hadoop command shell
- View the Job Tracker
- View HDFS
- Open Ports
- Manage Data
- Import Data From Data market
- Setup ASV - use your Windows Azure Blob Store account
- Setup S3 - use your AMazon S3 account
How To FTP Data To Hadoop on Windows Azure
The Apache Hadoop distribution for Windows includes a FTP server that operates directly on the Hadoop Distributed File System (HDFS). The FTPS protocol is used for secure transfers. FTP communication is wire efficient and especially suited for transferring large data set. The steps below describe how to use the FTP server.
- Log into the portal on http://www.hadooponazure.com/ .
- Click the Open Ports tile to access the FTP server port configuration.
While I'm not Hadoop'ing yet, when I saw these I knew I wanted to grab them for future reference....
Related Past Post XRef:
Big day for Big Data... Hadoop coming to Windows Azure, Windows Server and SQL Server
Hadoop for the SQL Server DBA...
Do you Hadoop? Angel has your links, news and resources round-up...
Microsoft SQL Server Connector for Apache Hadoop RTW