Some context…
As the title of the post suggests, this is not a Spark streaming primer. Frankly, this post is written for an audience who seeks to enhance a foundation of knowledge that has already been established on Spark and Spark streaming. I also find a surprising number of developers programming in Spark streaming without knowing the inner workings. Not just ‘hello world’ , they are writing code for production systems. It is not wrong but I believe a lot of design and performance decisions can be well made if you know the internals of any open source framework.
However, if you are a Spark newbie, don’t throw this post against the wall and curse me just yet. Here are couple of good resources for you to start with
- Apache Spark programming guide is the go to resource if you don’t have the foundation yet.
- Apache Spark streaming guide is the best resource available and close to a primer
You already got the basics and ready ? read on.Here we go, we are going to consider a Spark streaming example which reads data from Kafka and does some processing on it.
import kafka.serializer.StringDecoder
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka.KafkaUtils
val ssc = new StreamingContext(new SparkConf, Seconds(60))
// hostname:port for Kafka brokers, not Zookeeper
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092,anotherhost:9092")
val topics = Set("sometopic", "anothertopic")
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, topics)
:
:
ssc.start();
I would like to show chain of events that are set in motion once you start a spark streaming context. It’s a busy diagram, just follow the sequence numbers and spend some time.
Some of the details I brushed away intentionally
- I haven’t explicitly shown a cluster manager as I am assuming a spark standalone deployment.
- After step 4, the block meta data is written into fault tolerant log.
- Today there is direct Kafka receiver is available without need for WAL or in-memory block replication, I have chosen to show a regular Kafka receiver for simplicity.
Some observations
- Write-ahead logs (WALs) are nothing but fancy name for journals which RDBMs are using for a long time now.
- Once WAL is configured all the data received will be synchronouslywritten into WAL – Depending on your fault tolerant storage deployment (lets assume HDFS) and the volume of the incoming streams throughput of the entire system can be impacted.
- Reliable receivers ACK ONLY after data is written into WAL, and btw it is safe to turn off in-memory replication.
- Put together a cluster for Spark streaming, cluster for a fault tolerant storage and cluster manager – The hardware requirements for your spark streaming data pipeline is not very less as projected or am I missing something ?
- Upgrade options for spark app: shutdown with gracefully (option is available) , update code and start it again. But apps with using checkpoints cannot leverage upgrade option, not at least until 1.4, not very amusing.
- Monitoring options for spark app: Programatic monitoring using StreamingListener with Events onBatchSubmitted, OnBatchStarted etc.sounds good.
Closing thoughts
Spark and hence spark streaming works on top of data-at-rest. The data-at-rest is stored in memory in the form of RDDs the primary abstraction that Spark has to offer. The processing is sent towards data.