In my last article, I showed an example of a Spark Streaming application for ingesting application log files.
Today, I am going to cover off an overview of DStreams in Spark Streaming. In subsequent articles, we’ll really look in detail at some of the main Spark Streaming concepts.
We can source data from loads of sources in Apache Spark. That could be from a local drive; from HDFS / S3; a Kafka Stream; Flume; Kinesis and plenty more.
Spark then uses a concept of micro batching. You set a time window (e.g. 10 seconds) and Spark will batch together all the records it received in that time window and process them together.
Your data will now be in a Discretized Stream (or DStream for short). The DStream is a high level abstraction of a sequence of RDDs.
In the below, you can see a DStream example. Each micro batched record set produces an RDD. So, each RDD represents a particular time window (the time window you configure in your script).
RDD1 (time 0 to 1) — RDD2 (time 1 to 2) — RDD3 (time 3 to 4)
Operations on the DStream are operations on the underlying RDDs. We use a higher level API to interact with the DStream to make the management of all these RDDs that bit easier.
Give us an example?
Sure. In my previous post about the log monitoring script, I ingested text data into a stream. Let’s say we have the below files, with their modified times.
If I set my time window to be 2 minutes, I would have 3 RDDs in my DStream:
File1.log + File2.log
File3.log + File4.log
Okay, but how do those get picked up?
Spark Streaming monitors the directory path you provide it with. Which can be either a full directory path or it can use glob & wildcard characters to match a particular pattern.
You apply a schema to the files on ingestion, so it’s important that they follow a consistent data format – avoiding any pesky errors.
The modification time of a file is what determines whether it is part of your time window or not. This poses us an issue. When logs are written, they’re written at the beginning of the job & appended until the end of the job. The modification time of the file is set to be the beginning of the job (when it’s first written). So the fact that its contents change; but it’s modification time does not, means that it will not be reprocessed – updates, will be ignored.
There is a workaround for this. We should write our log files to an un-monitored directory. Once they’ve completed writing, we should move them to the directory that Spark is monitoring.
That is easier said than done – as there isn’t really a mechanism to check if the file is still being written. My approach is to use try and except mechanisms, in conjunction with traceback to write logs & then move them to the streaming monitor directory.