As the title suggests the scope of this multipart post is to evaluate how exactly-once processing is proposed in Google cloud data flow paper (link shared below) and hence implemented in the data flow service (which is the basis for Apache Beam).
Challenges related to tackling streaming data (a.k.a data-in-motion) comes in different flavours. Streaming data processing is also quickly becoming a very contented space polluted by half-assed open-source and closed-source tools. You’ve got a wide choice and it’s very easy to get lost in the technologies. But all the choices are NOT interchangeable here. I would like to make something very clear here, this is one of those problem spaces where the statement “There are can be many tools and frameworks, it is the responsibility of the users to carefully evaluate and choose based on the use-cases” is very true. Haven’t you heard mixed reviews about streaming frameworks like Storm and Spark streaming ? (A quick side-bar: If you knew the recent twitter release of twitter heron, apparently they were dissatisfied with performance of storm and stopped using it production. But since they have been using a lot of production code, to salvage the code they just worked on top of it to refine and fill gaps of storm with heron instead of jumping Spark streaming or Flink).
Anyways, saying that the streaming data processing space is polluted is a pretty bold statement to make, but you don’t have to take my word on this, please have a look at the recent progressive google paper The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing. This paper pretty much has called out all the existing frameworks stream, micro-batch, batch and pointed out how inadequate each one of them in one or other aspects of processing data-in-motion.
Googlers have captured all these issues as design requirements for data processing (streaming and batch) in their Dataflow service which is based on the google paper cited above. This paper promises an all-encompassing, unified framework for Batch (FlumeJava) and Stream processing (MillWheel). The streaming part of the framework is handled by another framework called MillWheel (which is based on the MillWheel paper). The below diagram NOT only covers all the requirements for a good streaming data processor but also batch processors.
While there are many aspects for streaming data processing as listed below,
- State management
- Out-of-order data / Late events
- Exactly-once/ duplicate event handling
- Fault tolerance
Let’s cut right back into the scope of this post, which exactly-once processing or duplicate-event processing and making your processing idempotent.Here is a excerpt from the MillWheel’s paper on solving exactly-once processing or duplicate-event processing. (and I am glad my thought process is very much in line with this proposal)
Full disclosure !
I am only going to state the proposal from the paper and draw some parallels to the proposal I made in one of my previous posts, but I will follow this up with an evaluation by building a tiny prototype and update this post (with a link to the code and so on).
Ok, here is the detailed step by step approach stated in the paper how this is achieved.
Upon receipt of an input record for a computation, the MillWheel framework performs the following steps:
While the approach is to generate a finger print for each event and use it for testing duplicates, the paper suggests the use of a Sketching / Probabilistic data structure, a Bloom filter to optimise the memory required for storing the finger prints, I have proposed to use a deterministic data structure, a Distributed Hash-map. I would still argue a Bloom filter wouldn’t make a good choice for this problem even though cheaper. Heres why
Bloom filters (BF) is used to probabilistically test subscriptions, the “No’s” are “Certain No” and the “Yes’s” are a “Probable Yes ” with the % of error rate. So there is always a possibility of “False positives” and “True Negatives”.
So the questions are,
- What if BF a “Probably Yes” and its a false positive, the system will falsely consider an event to be a duplicate and ignore it. In our case a“Probable Yes” is a “Probable duplicate”. If I have ignored the event as a “Probable duplicate” but its not a duplicate the second layer backed by a store can’t help, its an event loss. Most systems cannot tolerate that.
- BFs are solid when it comes to saying “Not found” So I can’t imagine the case of “Filter misses” needing a second layer backed by a persistent store.
- All in all Bloom filter needs N byte vector with M hashing functions to provide X% error rate after doing all this still I have scope for event loss and need a persistent store to fall back ? If you suggest tuning Bloom filter to 0% error rate then it will collapse back to a deterministic DS. I would rather not waste my compute power and do it with a deterministic DS like a distributed HashMap or may be think of something else.
Google cloud data flow paper in a way throws all the existing data processing frameworks under the bus but it clearly states the requirements for an all-in-one data-parallel processing framework and also shows how Google cloud data flow uses FlumeJava and MillWheel to achieve the same. While it is a solid data-parallel pipeline unifying batch and stream processing strategically speaking data flow’s late entry to the market needs to accounted as well. Hence google smartly goes one level up and puts together an Apache beam a meta framework specification for data processing. Apache beam abstraction implements all the primitives proposed by google cloud data flow paper. Now you can build data pipelines using user existing Flink or Spark or Google cloud data flow code. In Beam terminology user code can be ran with any of the above mentioned ‘runners’.
I will try and port one of my existing stream processing projects into apache Beam and report my experience. Stay tuned !.
Over to you now.