How to get started with Hadoop and Hive
Install prerequisites to manage your cluster
Prereq
|
Link
|
---|---|
Install .net4.5 | http://www.microsoft.com/en-gb/download/details.aspx?id=30653 |
Install powershell 4 | http://www.microsoft.com/en-gb/download/details.aspx?id=40855 |
Log into Windows Azure account
Sign up using http://azure.microsoft.com/en-us/ free trail link
Then click the portal link to manage your Azure services. You should end up with something like this menu on the side
Create a new Storage account
- Click on the storage link in the Azure left side menu
- Then click the new link at the bottom. This will prompt you with the below options to create a new storage account.
- Choose a unique name for your URL. If the tick box turns green it means your account name is unique
- Choose create storage account at the bottom
- This will then start creating your storage account, you may need to wait 5 mins for it to complete
Create new HDInsights cluster
- Click on the HDInsight link on the Azure left side menu
- Then click the new link at the bottom. This will prompt you with the options below to create a new Hadoop cluster
- Choose a unique name for your URL
- Choose 1 data node for the cluster size (unless you want to go crazy then be my guest)
- Select the storage you created in the above section
- Click Create HDInsight Cluster. This takes a while, especially first time. Between 5min-40min
Connecting to your Cluster
- When you click All Items in the top left menu, you should see something like this. Confirm your HDSight Cluster is running
- Open Powershell ISE
- Run the following
Get
-AzureSubscription
Get
-AzureHDInsightCluster
- Download the publish settings file to your local computer and keep note of the path
- Click on your HDInsight cluster Right arrow
- Then choose Dashboard
- Take note of your subscription name and your cluster name
Running Hive Queries against your Cluster
- Run a new script in powershell and replace configurations where nessasary
Import
-AzurePublishSettingsFile
"<FULL_PATH_TO_PUBLISH_SETTINGS_FILE>"
$subscriptionName
=
"<SUBSCRIPTION_NAME>"
$clusterName
=
"<CLUSTER_NAME>"
$querystring
=
"select country, state, count(*) as records from hivesampletable group by country, state order by records desc limit 5"
Select
-AzureSubscription
-SubscriptionName
$subscriptionName
Use
-AzureHDInsightCluster
$clusterName
Invoke
-Hive
-Query
$queryString
Here is an example i have usedImport
-AzurePublishSettingsFile
"C:\Powershell\Hadoop\jeremyking77Azure.publishsettings"
$subscriptionName
=
"Visual Studio Professional with MSDN"
$clusterName
=
"jeremyking77"
$querystring
=
"select country, state, count(*) as records from hivesampletable group by country, state order by records desc limit 5"
Select
-AzureSubscription
-SubscriptionName
$subscriptionName
Use
-AzureHDInsightCluster
$clusterName
Invoke
-Hive
-Query
$queryString
- You should get output like the following
Successfully connected to cluster jeremyking77
Submitting Hive query..
Started Hive query with jobDetails Id : job_1405933745625_0003
Hive query completed Successfully
United States California 6881
United States Texas 6539
United States Illinois 5120
United States Georgia 4801
United States Massachusetts 4450
No comments:
Post a Comment