How to get started with Hadoop and Hive
Install prerequisites to manage your cluster
Prereq
|
Link
|
|---|---|
| Install .net4.5 | http://www.microsoft.com/en-gb/download/details.aspx?id=30653 |
| Install powershell 4 | http://www.microsoft.com/en-gb/download/details.aspx?id=40855 |
Log into Windows Azure account
Sign up using http://azure.microsoft.com/en-us/ free trail link
Then click the portal link to manage your Azure services. You should end up with something like this menu on the side
Create a new Storage account
- Click on the storage link in the Azure left side menu
- Then click the new link at the bottom. This will prompt you with the below options to create a new storage account.
- Choose a unique name for your URL. If the tick box turns green it means your account name is unique
- Choose create storage account at the bottom
- This will then start creating your storage account, you may need to wait 5 mins for it to complete
Create new HDInsights cluster
- Click on the HDInsight link on the Azure left side menu
- Then click the new link at the bottom. This will prompt you with the options below to create a new Hadoop cluster
- Choose a unique name for your URL
- Choose 1 data node for the cluster size (unless you want to go crazy then be my guest)
- Select the storage you created in the above section

- Click Create HDInsight Cluster. This takes a while, especially first time. Between 5min-40min
Connecting to your Cluster
- When you click All Items in the top left menu, you should see something like this. Confirm your HDSight Cluster is running

- Open Powershell ISE
- Run the following
Get-AzureSubscriptionGet-AzureHDInsightCluster - Download the publish settings file to your local computer and keep note of the path
- Click on your HDInsight cluster Right arrow

- Then choose Dashboard

- Take note of your subscription name and your cluster name

Running Hive Queries against your Cluster
- Run a new script in powershell and replace configurations where nessasary
Import-AzurePublishSettingsFile"<FULL_PATH_TO_PUBLISH_SETTINGS_FILE>"$subscriptionName="<SUBSCRIPTION_NAME>"$clusterName="<CLUSTER_NAME>"$querystring="select country, state, count(*) as records from hivesampletable group by country, state order by records desc limit 5"Select-AzureSubscription-SubscriptionName$subscriptionNameUse-AzureHDInsightCluster$clusterNameInvoke-Hive-Query$queryString
Here is an example i have usedImport-AzurePublishSettingsFile"C:\Powershell\Hadoop\jeremyking77Azure.publishsettings"$subscriptionName="Visual Studio Professional with MSDN"$clusterName="jeremyking77"$querystring="select country, state, count(*) as records from hivesampletable group by country, state order by records desc limit 5"Select-AzureSubscription-SubscriptionName$subscriptionNameUse-AzureHDInsightCluster$clusterNameInvoke-Hive-Query$queryString - You should get output like the following
Successfully connected to cluster jeremyking77Submitting Hive query..Started Hive query with jobDetails Id : job_1405933745625_0003Hive query completed SuccessfullyUnited States California 6881United States Texas 6539United States Illinois 5120United States Georgia 4801United States Massachusetts 4450

