Introducing FunnelCloud – A lightweight abstraction atop Apache Storm

Idea of building a light weight abstraction on top of storm is to bring the best of micro-batching and processing flexibility of storm.FunnelCloud also has few added practical features.

Gwen Shapira, Confluent explains the value of micro-batching and how it improves the throughput in distributed architecture where n/w roundtrips are inevitable. Here is the full post.

 Let’s say due to network round trip times, it takes 2ms to send a single Kafka message. By sending one message at a time, we have latency of 2ms and throughput of 500 messages per second. Instead we could wait few for a milliseconds and send a larger batch – let’s say we decided to wait 8ms and managed to accumulate 1000 messages. Our latency is now 10ms, but our throughput is up to 100,000 messages per second! That’s the incentive of going for micro-batches. By adding a tiny delay, 10ms is usually acceptable for most critical applications, the throughput is 200 times greater.

As you might be knowing apache storm works on data-in-motion and contrary to the popular axiom here data is moved towards processing. A spout could be running in a different instance and your processing bolt could be running in a different instance. Applying the same principle, sending micro-batches of tuples towards the processing bolt will help us gain throughput. So the Trident abstraction of storm pretty much filled this gap and has the concept of micro-batching along with some SQL like high level expressive APIs for processing the batches of tuples. Here is a great experiment that vouches the idea of micro-batched tuples for better performance.

So what’s the catch ? look at the below diagram, a batch “batch 1” of N tuples where in all N-1 tuples are successfully processed and processing of Tuple N failed for a reason and you want to fail it and get it replayed by Trident.

So lets go ahead and investigate one feature at a time that Trident has to offer.

Fault and Failure recovery

How Trident framework handles fault recovery and failure ?


Trident replays the entire batch from Tuple 1 because the concept of Acks works at a batch level instead of tuple level. so if you have N tuples in batch 1, say N = 100 and 100th tuple failed for some reason. Trident fails the entire batch and replays it,  you need to have a logic to take care or ignore the already processed 99 tuples. Think about this, Your streaming data platform which persists data should take care of idempotency i.e. repeated processing of same tuples won’t affect the state of the system or It should have some logic to intelligently identify the tuples which are coming again into the system (not necessarily from Trident, could be duplicates from data source itself). Either way its reprocessing and/or extra effort from your side to implement this logic on top of your business logic.


How can fault recovery and failure be handled better ?

How about writing your own abstraction on top of storm which handles micro-batching, so you will have complete control over how you batch the tuples. Once you have complete control over how you batch the tuples you can have the best of both worlds – Micro-batching but replaying only failed tuples, thus avoiding any unwanted reprocessing. This may be very critical for some applications.

 

State, Idempotency and duplicate tolerance without Trident

While Trident framework along with its offering for state management is good and takes care of efficient state handling and guarantees that duplicated tuples (accidentally or deliberately) from same batch won’t affect the state. Trident offers Transactional and Opaque Transactional spouts for data sources like Kafka.

But had we decided to go with our own micro-batching abstraction for a performant fault and failure recovery option as described in the above section, Trident is out of the picture. So we are on our own when it comes to handling state handling and idempotency.

Straight forward way is to write additional logic on top of the business logic to handle idempotent states. But a quick wishful thinking how about a abstraction or framework handle idempotent states based on a tuple key you pass as input, So all you have to do is the business logic.

 

Introducing Funnel Cloud – A new storm framework with best of Storm, Trident and much more.


1. Offers custom micro-batching – count of tuples or time based.

2. Doesn’t replays entire batch on fault or failure – So quicker and efficient fault /failure recovery with minimal re-computation cost during recovery.
3. Custom idempotency logic with flexibility given to the users.
4. Users don’t have to write bolts and spouts – just confine contracts and let the framework do the job for you.
5. Pause/Play, Stop/Start (graceful) options for topologies.
6. Upgrade-while-active for topologies.

While all other storm features of guaranteed message processing and partitioning , shuffling et cetera are intact, proposal is to add all the above features on top and deliver a framework.

I am working on FunnelCloud as I write this article, stay tuned for updates.