In this article we will talk about how we design ETL pipelines. There are two core methods to achieve this: functional and object oriented.
In a functional deployment, we would structure our code in small chunks (or functions). This gives us small, easily testable chunks of code to process our data. These chunks can run in parallel or one after another leading to potential efficiency gains through parallel execution.
Object oriented approaches are better at modelling real world scenarios. Through this we can create objects with a set of attributes and methods (the name for functions in object-oriented programming). With this, we can take advantage of object oriented concepts.
Functional Processing
First, let’s walk through an example of a functional processing pipeline. In this, we would split our code into a number of chunks: Infrastructure (e.g. Databases, File storage); Adapters (the connectivity to the infrastructure) and Application (the features the application requires).
So in a scenario where we want to read data from Amazon S3; process it in some way and write it back to a different S3 bucket, we might design our pipeline as below:
ADAPTERS:
The adapter functions help us to connect to the data sources for the purpose of reading from & writing to the data source.
Function | Purpose |
Extract from Amazon S3 | This function will take a number of arguments (e.g. bucket name; API keys etc..) and will extract the data from the bucket and save it as a CSV. |
Write to Amazon S3 | This function will write our transformed dataset back to Amazon S3 and will take a number of arguments (e.g. target bucket name). |
APPLICATION:
These functions will deal with the transformation operations.
Function | Purpose |
Extract | This calls the Extract from Amazon S3 adapter function, passing the required arguments |
Transform | This is a standalone function which takes the resulting data from the Extract function and manipulates it. |
Load | This calls the Write to Amazon S3 function from the connector, passing it the transformed dataset. |
COORDINATION
The final step is to run all of the required functions. For this, we will have a main function that passes the required arguments to the Extract, Transform and Load functions for execution.
Object Oriented Processing
Object Oriented Programming is great when you’re working with concepts in the real world but classes are by no means a necessary construct in Python, they’re mostly used for code simplification, readability and re usability.
If we think about a social media ETL pipeline, we may create a class called PostObject. In this class, we will have many methods:
- Connect to social network. The attributes of our class are Username, Password and API endpoint. This would allow us to re-use this connector class for Twitter, Facebook, Instagram etc.. creating a different instance of the class
facebook = social(‘username’, ‘password’, ‘endpoint’)
Twitter = social(‘username’, ‘password’, ‘endpoint’)
- Clean up the social posts via a predefined method (e.g. removing stopwords, standardizing case, removing blacklisted words and so on…).
Facebook.cleanup()
Twitter.cleanup() - Finally, we’ll have a method to save the output.
Facebook.save()
Twitter.save()
In this case, what is the object? Well, the object is the social network itself. In our case, we’ve used the OOP principles of using classes and methods to determine what we can do with the connection that we’ve made to the API.
There are benefits to using an OO approach to ETL – like defining common interfaces. However, in my experience, it’s an overly complicated approach to what should be a functional pipeline.
My Optimal Design
I’ve designed lots of ETL pipelines over the years. Some processing over 200 Terabytes of data in batch and some processing a few Gigabytes – we won’t talk about streaming pipelines just yet. My key takeaways from this experience are:
- Make your code modular for readability:
- Extract.py
- Transform.py
- Load.py
- ETL.py to co-ordinate
- Use functions and write unit tests for those functions. The best way I have found to do this is to produce sample data to pass into the function where you know the output you expect. You can then use the pd.assert_frame_equals(expected, actual) functionality of Pandas to compare your expected result to the result the function actually output – if they match, your test passes! You can then use unittest or pytest to automate all your testing.
- Bake error handling and monitoring into your pipeline.