Handling corrupted input files during ETL

Often, we develop ETL pipelines and they just work. They do exactly what we expect them to do and produce the correct outputs. One day though, your script stops working and the error points to it being a corrupted file.

That’s pretty annoying – it took you away from doing something you enjoy and now you’re troubleshooting an issue with a script you wrote months ago.

The easiest thing to do, is to handle those issues proactively while you’re developing your ETL pipeline. This article will look at how we do that in Spark.

Handling this is really quite simple. First, we add a line to our session config or to our spark submit command. That is simply:

–conf spark.sql.files.ignoreCorruptFiles=true

This controls whether Spark is going to ignore corrupted files or not. If we set the value to True, the Spark jobs we submit will continue on and simply disregard the corrupted file.

The next thing is to capture the data that was ignored. To do that, we can add a setting to our spark config:

.option(‘badRecordsPath’, ‘tmp/badRecs’)

This will give us a log of those files which were corrupted.