Someone asked me in Quora “Should I use Gobblin or Spark Streaming to ingest data from Kafka to HDFS?” Here is what I wrote: This introduces a new architecture pattern called continuous streaming integration (CSI) with streaming data platforms (SDP) for solving the app and data integration challenges. Short answer: If your data sink is… Continue reading
Post Category → Fast data, streaming, CEP
TAPO for Airports – A Streaming usecase
Airports, especially the busy ones face an interesting challenge when it comes to serving the commuters, they need a smoother way to handle passengers in queues without long frustrating waits and thereby elevate the overall experience. No one likes to wait/stand in long queues. But airports, unfortunately, have lots of queues one for check-in, baggage… Continue reading
Apache Flink CEP and ATM Fraud usecase- Part 2
On the 1st part of this multi-part series on Apache Flink CEP library, I briefly covered the case for a dedicated CEP framework among the toolsets of open-source stream processing frameworks. Quick recap on the use case For a customer, an ATM Withdrawal Txn >= 10,000 made more than ‘3 times’ in a location > 50 mile radius… Continue reading
Apache Flink CEP Library – Part 1
I am presuming you know the what’s and why’s of Apache Flink, touted as a one of the best data processing framework that can do both batch and streaming processing. Recently Flink announced a cool new CEP Library. Just hang on with me, before going any further let me just say the reason for this post, some time back a lot… Continue reading
Introducing FunnelCloud – A lightweight abstraction atop Apache Storm
Idea of building a light weight abstraction on top of storm is to bring the best of micro-batching and processing flexibility of storm.FunnelCloud also has few added practical features. Gwen Shapira, Confluent explains the value of micro-batching and how it improves the throughput in distributed architecture where n/w roundtrips are inevitable. Here is the full post. Let’s say due… Continue reading
Please don’t call Kafka as a messaging system
Update, 11/Nov/2016: Originally this post was titled “Please don’t call Kafka as a messaging system”, I had to change it as some people went “what else would you call it ?”. The Kafka tagline used to be something like “Message processing rethought”, but looks like later they changed it to “A Distributed streaming platform”. So, I am changing the… Continue reading
Internals of Spark Streaming
Some context… As the title of the post suggests, this is not a Spark streaming primer. Frankly, this post is written for an audience who seeks to enhance a foundation of knowledge that has already been established on Spark and Spark streaming. I also find a surprising number of developers programming in Spark streaming without knowing the inner… Continue reading
What you need to know before writing Streaming APIs
What are Streaming APIs? Streaming APIs are not to be confused with multimedia streaming API services like Netflix or Youtube. Industry is starting to use a newer breed of REST APIs called the Streaming APIs to offer a “high-throughput” pipeline to receive curated data. With these APIs you can capture information in real time. It’s… Continue reading
Beginner’s guide to Fast Data
What is Fast data ? Fast data is becoming a catch-phrase as we speak. If you are hearing for the first time, please don’t worry. We are going to talk about it detail in this post. (But I am going to assume some big data background from you). Let me start by graphically telling the… Continue reading