Beginner’s guide to Fast Data

This is a 4 min read

What is Fast data ?

Fast data is becoming a catch-phrase as we speak. If you are hearing for the first time, please don’t worry. We are going to talk about it detail in this post. (But I am going to assume some big data background from you). Let me start by graphically telling the difference between Fast data and Big data. (Credit to guys for explaining the difference so simply).

Fast data

Big data

Simple as that. As for as Fast data is concerned value is in the freshness of the data, the temporal aspect of the data. Its for the business user who goes “I want to know what happens now…not after an hour or a day”.

So Fast data  is reacting and/or responding to data the instant it arrived i.e when data is on-the-flight. Processing streams of events/data-in-flight is called Stream Processing. So stream processing the core of Fast data platforms.

Why would one want to know “what happens now” ?

Think about scenarios below,

  • Is the transaction fraudulent ?
  • How to make real time decisions based on stream of user-clicks in website or mobile app ?
  • Is my customer within 1 mile radius of my store ? (Location based marketing strategy)
  • Real time dashboards – Counting things (Tweet counter, American Idol vote counter …)

We cannot let a transaction get into a banking system database and determine if its fraudulent or not. can we ? When the mobile app tells a customer is inside a geo-fence (as setup by the business) you should be able to push a location-aware offer engage the customer in real-time. These are simple needs but huge volumes of data in real-time complicates things.  Thus we need a high-performant Fast data platform backed by some low latency stream processors. But before moving forward just think about this for a minute aren’t anybody doing something like this already, for instance don’t stock markets monitor time-series updates of stock prices and let us know if it went up or down ? In fact they can give lot more insights in real-time. So, Yes lot of people have and are doing Fast data for a long time now.  Ok? So what….

Acting on data as it arrives has been thought of as costly and impractical if not impossible and has been restricted to high-value use cases of wall street firms and stock markets. But thanks to commoditisation of hardware (memory, processing power) – they are cheaper than ever. Thanks to companies like Google, Facebook and Linked for generously open-sourcing their secrets on how to do big and fast data effectively and efficiently in commodity hardware.

So what I am saying is, today even a small startup can do big/fast data to derive valuable insights without spending boat load of money on complex systems.

So what makes a Fast data processing platform ?

So a Fast Data processing platform to to process data arriving at tens of thousands to millions of events per second, you will need three technologies:

  • An Event Streaming system capable of receiving and delivering events as fast as they come in, a.k.a the Firehose.
  • A Stream processing system to process events as fast as it arrives.
  • A data store capable of storing the processed data.

Stream processing frameworks

Without getting into much details of stream processing flavours like CEP, lets look at what are the some of top streaming frameworks which are available that you can consider using.

  1. Apache Storm, Storm Trident, DRPC (by Nathan Marz, the guy who coined lambda architecture)
  2. Apache Spark/Spark Streaming. (by UC Berkley)

Both Apache Storm and Spark streaming are distributed real-time stream processing frameworks. Of course without Integration with Apache Kafka neither spark streaming nor storm would become a complete streaming processing solution. Kafka is a distributed message publish/subscribe framework typically positioned as the firehose. While there is a most of the stream implementers fragment between the above 2 frameworks there are also other less popular but good frameworks. But here is a surprise package and a recent addition- VoltDB as database company who call volt as “an in-memory transaction database” claim that they can handle stream processing and complex event processing without any fuss and fewer hardware to deploy. Sounds very promising but I cannot comment on this as I have not used it.

Over to you

Thats about a basic introduction to Fast data. If you want to get some deeper insights on how Storm and Spark streaming compare and contrast, here refer one of my older posts. Also if you want to understand lambda architecture here refer another older post which talks about how lambda jives into sports.  Hopefully this was informative and if you have read this far you should have got at least some takeaways. In another instalment I would like to share the simple counting program to update a real time dashboard that I wrote using Storm and Trident, this should gives you the next level understanding on putting a stream processing framework to use. This exercise will involve reading streams of data from external sources, processing data, maintaining/storing state and updating the dashboard. By dashboard please don’t imagine anything flashy, its just a HTML file served from a local node.js web server.