The term Big Data is in the verge of becoming a household term. Have you ever wondered why there is so much movement around that phrase? Let me give you my view on this. Information systems are about data collection, processing, storing, retrieving and deleting. Between a decade ago and now there isn’t any change in what Information systems do, but they just do the things differently.
Reason ? Web 2.0 , social media, mobile proliferation and IoT. Did I lose you there ? Hold on to that thought for a minute, let me explain.
So what has changed between then and now? How does that impact Information systems?
Trend 1: User empowerment
Introduction of web 2.0 concepts like Wikipedia and Social Media has been a huge paradigm shifts. Amount videos, photos, text coming from Youtubes, Facebooks, Twitters and other user empowered social mediums are bigger than ever. Neither the type of the data nor the volume is familiar for us decade ago.
Trend 2: Affordable, better storage and processing power
Storage and processing are getting better and cheaper as we speak. There are software frameworks which can use a cluster of computers to do a complex processing by splitting the load across the cluster.
Trend 3: IoT
Next level of empowering “everything” to be a data source is connecting consumer electronic “things” to internet.
Let’s build a live score card for cricket
Of course when I meant sport I would like to talk about my favourite sport here, cricket. Like any other sport cricket has been a source of huge volumes of data. With advent of big data and related tools/frameworks showing ball-by-ball updates will never be the same. Remember the times when we have to refresh the live score card ? If you observer closely sites like starsports.com, espncricinfo.com and most of the news sites show live score cards the score cards updates itself without refreshing the entire screen. No we are not talking about advancements in web technologies to open persistent pipes like web-sockets to get latest data, we are talking about how data is collected, processed and displayed at an affordable scale.
Lets walkthrough the steps involved in building a mobile app which shows a live score card. Remember key requirement is – The dashboard is real-time and the score updates are automatic without any visible lag.
Clearly we need to get ball-by-ball ticks from some source to build what we are willing to build. There are lot of providers for these kind of live scores, cricbuzz.com provides live ball-by-ball feeds for a paid subscriber. For this exercise lets imagine we have subscribed to cricbuzz already for multiple feeds like ball-by-ball updates, live commentary etc..Already you should be relating this to us building something which works like a publish-subscribe model.
Now before getting into the next steps, lets just see what kind of information a live score card has. Above is a screen short of one such live score cards. As you could see the scorecard was opened at a point of time where game has been progressed to a score of 244/1 at 41.4 overs. Part of the data displayed is “data-at-rest” (highlighted in red) and the other part is the data that is coming from the field as they play i.e. data-in-flight (highlighted in green).
The paradigm shift of “reacting to data as it comes” pretty much was made possible due to big data tools and frameworks. So in our scenario where the game has progressed until 41.4 overs, we need a querying layer that works on data-at-rest as well as data-in-flight.
So essentially at any point in time we always will have a tiny margin of the game not processed and will be coming straight from the field. Addressing this lag and serving queries by merging data-at-rest as well as data-in-flight is called the Lambda Architecture (LA). Storm fame Nathan Marz practically came up this and coined this term, below diagram should give you an idea. I am not saying live score cards are built using LA, I am proposing to build one using LA.
How does lambda architecture apply to our case ? Here is how our architecture for our live score card system will look like. As you can see this is a classical implementation of lambda architecture. So as per lambda every single event, i.e. ball-by-ball tick will be fed simultaneously into a batch layer and a speed layer.
Each layer can have multiple views. For instance batch layer can have 3 possible precomputed views from the ball-by-ball tick as shown below. The batch layer precomputes the data and stores in the data store.
Speed layer will have multiple real-time views. Lets talk about one such possible view , in our case when the user opened the live score card a single query was fired to get the current score. The serving layer sends the query to the batch layer to get the last batch view which probably returned 244/1 at 41.4 overs, which means the batch layer basically says “thats the latest value I have and you will have to get the real time progress after what I have from the Real-time view of the speed layer”. Then the serving layer sends query to the speed layer. The merged result could be 244/1 at 41.4 + all the progress after that i.e. 1 run in 41.5, 2 runs in 41.6 as you can see from the figure.
As you might have sensed by now the real time views are data views of very short time window. So, how to come up with that window? Think about it – if the ball-by-ball tick going into a batch layer will take 5 mins to get realised as a batch view, then I better be having 5 mins worth of data in real time views before I flush it. Feel nervous about measuring the time taken for realising your batch views ? just for safety have 2x of the batch window worth data in real time view.
Another possible real-time view in speed layer could be an aggregating layer, the last flow shown in the architecture diagram. How do you think updates like “batsman x has reached 5000 runs in ODI” or “This is the highest 6th wicket partnership in world cup history” is calculated? These are classical examples of aggregating ball-by-ball ticks against specific rules or milestones.
The former flavour of speed layer is called Stream processing (SP) while the later is called Complex event processing (CEP). SP and CEP are 2 important concepts in real time processing of big data.
SP and CEP are topics of their own for a separate post, but I hope I made sense trying to implement a LA for a Cricket Live Score card.