Update 4, Nov 2016: When I first wrote this post it was outright mockery and contempt. But the Google Data flow paper (The Unified google framework for Batch (FlumeJava) and Stream processing (MillWheel)) and the Google MillWheel paper clearly explains that this is exactly the same approach google team has taken to solve the duplicate events problem…. Continue reading
Post Category → Big Data
Art of choosing a datastore
Update 3,Nov 2016: When I first wrote this post, there were a lot of opinions/comments (in my older blog) about how I am wrong in thinking about choosing a datastore is almost like choosing a data structure when writing a program. Here is an excerpt from Nathan Marz’s book “Big Data: principles and best practices of… Continue reading
Please don’t call Kafka as a messaging system
Update, 11/Nov/2016: Originally this post was titled “Please don’t call Kafka as a messaging system”, I had to change it as some people went “what else would you call it ?”. The Kafka tagline used to be something like “Message processing rethought”, but looks like later they changed it to “A Distributed streaming platform”. So, I am changing the… Continue reading
Internals of Spark Streaming
Some context… As the title of the post suggests, this is not a Spark streaming primer. Frankly, this post is written for an audience who seeks to enhance a foundation of knowledge that has already been established on Spark and Spark streaming. I also find a surprising number of developers programming in Spark streaming without knowing the inner… Continue reading
What the heck is Apache ZooKeeper anyway ?
Here is an attempt to intuitively explain how ZooKeeper works and how it can used. 1. At a High level ZooKeeper is a service for sure – that provides access to clients to a tree like structure or a hierarchical namespace as ZooKeeper documentation says. So why we need this tree ? Of course for… Continue reading
What you need to know before writing Streaming APIs
What are Streaming APIs? Streaming APIs are not to be confused with multimedia streaming API services like Netflix or Youtube. Industry is starting to use a newer breed of REST APIs called the Streaming APIs to offer a “high-throughput” pipeline to receive curated data. With these APIs you can capture information in real time. It’s… Continue reading
Beginner’s guide to Fast Data
What is Fast data ? Fast data is becoming a catch-phrase as we speak. If you are hearing for the first time, please don’t worry. We are going to talk about it detail in this post. (But I am going to assume some big data background from you). Let me start by graphically telling the… Continue reading
Lambda Architecture and Sports
The term Big Data is in the verge of becoming a household term. Have you ever wondered why there is so much movement around that phrase? Let me give you my view on this. Information systems are about data collection, processing, storing, retrieving and deleting. Between a decade ago and now there isn’t any change in… Continue reading
What wikipedia can’t tell you about Apache storm and Apache spark streaming
I am seeing a lot of questions around Spark streaming and Storm in Quora. When to choose what and what are their performances, reliability and support like. There are a lot of comparisons as usual available in the web , if you google around you could find. But instead comparing them side by side I thought of talking… Continue reading
What you didn’t know about Real-time notification systems
I have been intrigued by Event Notification systems for a long time now, In fact this started from my programming days in legacy environments like iSeries. So I started working on a toy project which evolved into a solid project. I thought I will muse about that recent project the RealTimeNotification. But before going into the details of the… Continue reading