Big Data Ingestion

I discussed in the previous post about choosing the right data ingestion technique for the Hadoop ecosystem and how critical is getting this first step right. In my experience subsequent steps like  applying some data science to that and converting them into insights are comparatively less complicated. As discussed in  previous the post there are multiple ways to ingest data into Hadoop ecosystem, typically it what kind of data it is – transactional(data in motion) or non-transactional (data at rest) . To further beak it down it depends on latency (data rate and how quickly you have to respond to them), fault tolerance of the system (can we afford to miss the data, how reliable is the system), infrastructure maintenance, skills and overall budget you have in mind. Let me share you glimpse one of my recent architecture blue print which highlights the way we adopted to ingest data in real-time.

Shoply Bigpicture

Before getting into the details of the above implementation here are some of the sample data sources from which typically one might want take and ingest data into Hadoop.

  • RDBMS –  one could use SQOOP scripts, OOZIE scheduled.
  • Enterprise DataWarehouse– ETL providers like Informatica has great  tools for loading data into Hadoop, one could use them.
  • Low latency real time data  – This is usually referred as streaming and some of the popular tools are Apache Flume, KAFKA or Amazon Kinesis. Following are some of the examples for real time streaming data
    • Social media streams.
    • Weather Sensor or Traffic data.
    • Web click streams.
    • Mobile activity streams.

Now in the sample implementation shown above though we didn’t have to respond to the events generated from the mobile app in real time we chose a streaming solution as a strategic choice to  load data into the Hadoop ecosystem.  Our choice of tool was AWS Kinesis  (read why kinesis is better than its cousins).  Some of the reasons are

  • Ease of implementation (no big initial setup, up and running in few minutes)
  • Low development effort (Java KCL and SDK , can directly integrate streams of data into temporary HIVE tables)
  • Managed and scalable. (Kinesis streams can practically be replicated to all data centres which AWS supports)
  • Cheaper(overall cost is cheaper checkout AWS kinesis pricing)

Knowing the different methods to ingest different types of data into Hadoop, you can actually mix and match and arrive at solutions which makes sense to situation . This link has some good examples for each of the above data ingestion flavours.

What is your experience in BigData ingestion ?