Someone asked me in Quora “Should I use Gobblin or Spark Streaming to ingest data from Kafka to HDFS?”
Here is what I wrote: This introduces a new architecture pattern called continuous streaming integration (CSI) with streaming data platforms (SDP) for solving the app and data integration challenges.
Short answer:
If your data sink is always going to be Hadoop (and ONLY Hadoop) it makes sense to look at purpose built frameworks like Gobblin. Because Gobblin is a “any to Hadoop ” integrator. But, you have to answer yourself fist – “Is it going to be only Hadoop”? “Will I be dealing with some other related use cases where I have to land data into a sink other than Hadoop. Say, a search index to enable search or a NewSQL store for in-memory analytics?”. If your answer is “No” or even “Not sure” then you should NOT go with purpose-built data integration toolsas there can be long term implications on your data architecture. (Read the longer version for the implications)
You can’t go wrong with Spark streaming, it is a nice general purpose “data processing micro-batch framework” with which you could solve a bunch of use-cases. You could also land data into sinks. While thinking about data integration (in this case, landing data into Hadoop) as streaming problem is sexy, Spark streaming is a bit heavy and a full blown stream processing framework, unless you have to transform the data in multiple stages at spark streaming before landing it in Hadoop. Secondly, Hadoop is optimised for batch writes and has a very high bias for larger batched writes, your spark streaming should be designed around that. Finally, rolling on your own with spark streaming to do this type of data integration needs a lot of coding from your side which is unwarranted. If I were you I would spend that time solving real streaming use cases like SSP, CEP or some RTBI or RTOI, not just land data.
So, there are simpler (read simpler and not easier) ways to make data integrations gentle on all the team s involved.
Longer version:
-
Point-to-point data integration: A Spaghetti
While nothing wrong with the Gobblin itself, your data integration architecture, in the long run can be messy and unmanageable, especially if you have to bring in multiple purpose-built data integrations tools to integrate newer data sources and data sinks to solve business challenges. More often than not, you won’t know what data sources are coming your way, simply because it is completely based on business needs. This is exactly the challenge “application Integration” faced a decade ago and that to led to the emergence of a discipline (and a lot of white noise) called “Enterprise Application Integration”.
As you can see, as new data sources and data sinks come into the mix the data integration becomes harder. Hence, hard to maintain. System to Admin ratio is more than relevant for distributed systems.
2. A better alternative:
Say, we made a smarter move and chose spark streaming solution to generalise data integration, things would really will get better but you may have to write a lot of code, just to land data.
3. Much better and cleaner solution: Streaming integration
A much better and cleaner way is to implement a “continuous streaming integration” using Kafka Streaming data platform (i.e. Kafka connect, Kafka and Kafka streams ) to integrate data sources to data sinks. Kafka connect offers a bunch of connectors for various sources and sinks, if you could find a connector for your needs, its more of configuring rather than coding.
There is a whole movement around this space, check out ETL is dead, long live streams by Neha. I wrote about this long time back, Please don’t call Kafka as a messaging system. BTW “Continuous Streaming Integration” (CSI) is a terminology I am proposing, Neha calls it “Streaming ETL”.
My agony is, today even code is getting integrated continuously with CI/CD. With the data deluge we are talking about, why can’t data and app integration be continuous. So CSI = EAI (Enterprise application integration)+ EDI( Enterprise data integration).
In Summary,
- ETL, A flavour of legacy EDI assumed both sources and sinks will relational databases (.. and now playing catchup)
- Then came frameworks like Flume and Gobblin for landing data in Hadoop , the latest fad in EDI. While these tools accommodated multiple data sources they still assumed data sink is always going to be Hadoop.
- Continuous Streaming Integration as an architecture pattern makes no assumptions about data sources and data sinks. It is a simple implementation of a MIMO (Multi-in, Multi-out). The best part is you could write your own source and sink connectors. It is built on top of the tried and tested Kafka streaming platform i.e. to say anything and everything ever offered as a feature by Kafka is available for you. You can support near real-time, interactive and batch use cases with single integration layer.
- This is NOT just a pretty presentation – I have built a solution based on the recommended architecture pattern shown in section 3. Part of it is in production. (I wanted to write that up for the Kafka summit as a paper , but I missed the deadline 🙁 )