Streaming applications need to be fault tolerant. They need to be able to cope with system failures, JVM crashes and all that sort of horrible stuff. To cope with that, Spark Streaming uses the concept of checkpointing. That is, it stores metadata about which files it’s processed so far – so if it fails, it can pick back up where it left off.
When we checkpoint our stream, we save the configuration of the stream; the pending batches of data in the DStream and all the files that have already been processed to fault tolerant HDFS storage, which it can use to recover from a failure.
The checkpoint stores some stream data too, it’s the delta between the last checkpoint and now, so we can replay the data and get back up to speed. This can be quite a lot of data, it really depends on the frequency of your check-pointing.
The more frequently you checkpoint, the less data you have to store, the less changes have occurred between the checkpoint and the system failure and the faster you’ll be back up and running but it adds IO overhead to the disk and can make a performance hit on your job. So the frequency needs to be tweaked to suit your needs.
Ultimately, check-pointing enables us to decouple our application, where failure of one component doesn’t bring down the other components. This is key in mission critical applications that are processing data – you can’t afford any data loss!
Implementing checkpoint couldn’t be simpler. As below, you simply set the checkpoint location folder – in this case, I have called my checkpoint folder, simply checkpoint (which I realize is a bit confusing, sorry 🙂 )
In a future post, I’ll talk about how we recover from failure using checkpointing!