Internals of Spark Streaming

Some context…

As the title of the post suggests, this is not a Spark streaming primer. Frankly, this post is written for an audience who seeks to enhance a foundation of knowledge that has already been established on Spark and Spark streaming. I also find a surprising number of developers programming in Spark streaming without knowing the inner workings. Not just ‘hello world’ , they are writing code for production systems. It is not wrong but I believe a lot of design  and performance decisions can be well made if you know the internals of any open source framework.

However, if you are a Spark newbie, don’t throw this post against the wall and curse me just yet.  Here are couple of good resources for you to start with

  1. Apache Spark programming guide is the go to resource if you don’t have the foundation yet. 
  2. Apache Spark streaming guide is the best resource available and close to a primer

You already got the basics and ready ? read on.Here we go, we are going to consider a Spark streaming example which reads data from Kafka and does some processing on it.

import kafka.serializer.StringDecoder
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka.KafkaUtils

val ssc = new StreamingContext(new SparkConf, Seconds(60))

// hostname:port for Kafka brokers, not Zookeeper
val kafkaParams = Map("" -> "localhost:9092,anotherhost:9092")
val topics = Set("sometopic", "anothertopic")
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, topics)

I would like to show chain of events that are set in motion once you start a spark streaming context. It’s a busy diagram, just follow the sequence numbers and spend some time.

Some of the details I brushed away intentionally

  • I haven’t explicitly shown a cluster manager as I am assuming a spark standalone deployment.
  • After step 4, the block meta data is written into fault tolerant log.
  • Today there is direct Kafka receiver is available without need for WAL or in-memory block replication, I have chosen to show a regular Kafka receiver for simplicity.

Some observations

  • Write-ahead logs (WALs) are nothing but fancy name for journals which RDBMs are using for a long time now.
  • Once WAL is configured all the data received will be synchronouslywritten into WAL – Depending on your fault tolerant storage deployment (lets assume HDFS) and the volume of the incoming streams throughput of the entire system can be impacted.
  • Reliable receivers ACK ONLY after data is written into WAL, and btw it is safe to turn off in-memory replication.
  • Put together a cluster for Spark streaming, cluster for a fault tolerant storage and cluster manager – The hardware requirements for your spark streaming data pipeline is not very less as projected or am I missing something ?
  • Upgrade options for spark app: shutdown with gracefully (option is available) , update code and start it again. But apps with using checkpoints cannot leverage upgrade option, not at least until 1.4, not very amusing.
  • Monitoring options for spark app: Programatic monitoring using StreamingListener with Events onBatchSubmitted, OnBatchStarted etc.sounds good.

Closing thoughts

Spark and hence spark streaming works on top of data-at-rest. The data-at-rest is stored in memory in the form of RDDs the primary abstraction that Spark has to offer. The processing is sent towards data.