Apache Flink CEP and ATM Fraud usecase- Part 2

This is a 2 min readOn the 1st part of this multi-part series on Apache Flink CEP library, I briefly covered the case for a dedicated CEP framework among the toolsets of open-source stream processing frameworks. Quick recap on the use case For a customer, an ATM Withdrawal Txn >= 10,000 made more than ‘3 times’ in a location >  50 mile radius… Continue reading

Apache Flink CEP Library – Part 1

This is a 4 min readI am presuming you know the what’s and why’s of Apache Flink, touted as a one of the best data processing framework that can do both batch and streaming processing. Recently Flink announced a cool new CEP Library.  Just hang on with me, before going any further let me just say the reason for this post, some time back a lot… Continue reading

Wide row data modelling with Apache Cassandra

This is a 4 min readI have always been intrigued by the performance claims of Apache cassandra. So, I wanted to put the whole “wide rows” and the performance edge claims that wide-row data model said to offer to the test. Rumour has it, Facebook hired ex-Amazon engineers who wrote Dynamo  to build cassandra. Anyways, a sound starting point is to… Continue reading

Why Zookeeper is always configured with odd number of nodes ?

This is a 2 min readSomeone in Quora.com asked me  “Why Zookeeper is always configured with odd number of nodes ?”. Well, thats a great question but sad part is, not even many practitioners, even those who use Zookeeper in production can explain it simply. I will try to keep this really simple, I promise. ZooKeeper (ZK) is a highly-available, highly-reliable and… Continue reading

Terminology confusion: Column Stores and Column oriented databases

This is a 3 min readThis is my attempt to clear the air in the subjects of Column Stores and Column oriented databases (both at terminology and at understanding level). I will be talking a bit about how terrible is the idea of grouping column oriented databases as flavour of NoSQL data stores. What is a column store really ? There is no scope… Continue reading

How does the Log-Structured-Merge-Tree work?

This is a 6 min readIf you are wondering why should you care about LSM Tree, In one of my previous posts Art of choosing a datastore , I have briefly touched upon LSM-Trees. But this writeup is the best out there if you want to learn the inner workings of a LSM-Tree. How does the Log-Structured-Merge-Tree work? This was Quora answer by David Jeske…. Continue reading

Introducing FunnelCloud – A lightweight abstraction atop Apache Storm

This is a 4 min readIdea of building a light weight abstraction on top of storm is to bring the best of micro-batching and processing flexibility of storm.FunnelCloud also has few added practical features. Gwen Shapira, Confluent explains the value of micro-batching and how it improves the throughput in distributed architecture where n/w roundtrips are inevitable. Here is the full post.  Let’s say due… Continue reading

“Exactly-once” with a Kafka-Storm Integration

This is a 4 min readUpdate 4, Nov 2016: When I first wrote this post it was outright mockery and contempt. But the Google Data flow paper (The Unified google framework for Batch (FlumeJava) and Stream processing (MillWheel)) and the Google MillWheel paper clearly explains that this is exactly the same approach google team has taken to solve the duplicate events problem…. Continue reading

Art of choosing a datastore

This is a 8 min readUpdate 3,Nov 2016:  When I wrote this post, there we lot of opinions/comments (in my older blog) about how I am wrong in thinking about choosing NoSQL datastore is very much like choosing a data structure when writing a program. Here is an excerpt from Nathan Marz’s book “Big Data: principles and best practices of scalable real-time data systems” that… Continue reading