HaaS – Hadoop-as-a-Service

This is a 3 min read

Realising Big Data platforms using Hadoop is becoming an easier decision by the day for today’s enterprises. As soon as the outcome of a technical due diligence indicates that going “data driven with Big Data” is the going to make a positive impact (amongst other apprehensions), immediate option that pops out is Hadoop.

Hadoop-as-a-Service (HaaS) is generally referred to Hadoop in the cloud, a handy alternative to on-premise Hadoop deployments for organisations with overwhelmed data centre administrators that need to incorporate Hadoop but don’t have the resources to do so. So what’s the natural decision cycle enterprise has to go through.

 Here comes the torment of choice,

 

Screen Shot 2014-09-11 at 8.37.57 pm

The basic challenge of employing Hadoop is setting up and managing a Hadoop cluster and doing it efficiently and in a cost effective manner is largely unachievable for small and medium sized enterprises. It is more economical to do this in house for large-scale data driven companies like Yahoo or Facebook. As you can see in the figure above for smaller enterprises and startups it is easier to have a data platform set if they go with option 2 or 3. Beware but these 2 options may be limited and you might want to check out what use cases are covered by each of these offerings. There are many HaaS providers and here is a good article on how to choose the best HaaS offerings for your requirements.

 # Enter HaaS

In fact options 2 and 3 aren’t too very different ones because both these options take Apache versions of Hadoop, customise it a bit to suit enterprise needs, add bells and whistles (newer open source add-ons) which makes things easier for provisioning a Hadoop cluster and deploying solutions atop. Major differences between offerings 2 and 3 are the level of abstraction for the end users and may be the types of use cases supported.

For instance, The Amazon Web Services ElasticMapReduce would be an example of a basic platform service offering a rudimentary Hadoop ecosystem. The Qubole Data Service (QDS), for example, is an example of a comprehensive software-like Hadoop service that provides a complete Hadoop stack with front-end They just come with a simple GUI for users through which they can play with any Hadoop flavour. Hadoop flavour depends on which HaaS solution you are going with.

While there are many offerings under option 2 I tried the most accessible and free to try (no credit card required) offering from qubole.com. It comes with pay as you go and bulk payment options.

# The experiment

  1. I signed up in Qubole.com and I got myself an AWS free tier account (credit card required).
  2. In AWS console, I used the IAM link to create test user with admin access policy. This will give you a AWS account key and secret.
  3. Qubole.com requests you to link your AWS account by entering you AWS key/secret for both S3 and EC2 along with the AWS region you are setup and S3 Bucket name.

Screen Shot 2014-09-11 at 8.10.28 pm

 

That’s it you are all set.

  • You could use the data wizard under ‘Analyze’ section create external hive tables pointed to the files in S3 buckets.
  • You could generate and run queries using the Smart query wizard under the ‘SmartQuery’ section with few clicks.
  • You could start a 2 node (default) cluster with a click of a button and schedule custom MapReduce jars with input data from S3.
  • As you could see from my homepage history i have tried them all.