As data engineers and data scientists, we’re spend a lot of time exploring data. When you’re working with huge datasets, you may find you need to utilise Apache Spark or similar to conduct this exploratative analysis but for the majority of use-cases, Pandas is the defaqto tool we choose.
The Pandas library has been so heavily adopted because of its simplicity – exploring your data is almost enjoyable as the library is intuitive and easy to use. However, sometimes, just sometimes, it does become a bit tedious. When you’re looking at relationships between many fields; when you’re filtering the dataframe time and time again to look at the data from different angles and when you’re creating tonnes of visualizations in MatplotLib or Seaborn – that is time consuming and a bit frustrating.
So here then enters the world of Pandas GUI’s. I’ve spoken about a couple of other GUI’s in previous articles – both of those are open source. The option I am discussing today is the much more refined, commercialized equivalent to those libraries.
It’s called bamboolib and as you can see from the below snippet of code, we don’t actually need to do anything to our normal Pandas code, except for importing the bamboolib library at the beginning.
path = '/home/Datasets/diabetes2.csv' import bamboolib as bam import pandas as pd df = pd.read_csv(path) df
When you show your dataframe, the bamboolib interface will load within your Jupyter notebook. An example of that interface is below, using our Diabetes dataset which I’ve used in previous articles (sourced from Kaggle).
We can do quite a few things to our dataframe using the tool. We can select the columns we want in our dataframe; we can filter it; sort it and group by / aggregate the data. Let’s test filtering it.
In the below, I have chosen to select rows where the pregnancies column is greater than 1. We could also choose to drop all the rows where that condition was met, should we choose to do so.
The most valuable piece of this tool by far is that it actually writes the code for you. When I hit execute in the above filter pane, my code is updated, as below, to include a filter condition. How cool is that… Imagine the time you will save, when doing your analysis!
We can also choose to do some aggregations. In the below, I’ve taken the Age and then calculated the average pregnancy count for that age.
Next, we have the graphing functionality and I really can’t express how impressed I am with this. Here, we choose our chart type, we can set a whole bunch of config options and as you can see, it spits out the code you’d need to use to generate that chart.
And quite frankly, it’s not a bad looking chart.
The final thing I am going to look at are the data summaries. Here, you can see some high level information about the field. You can also (using the tabs), see the predictive power of each of the features and the correlations.
You can also drill down into each field to see more detailed information about that field. Again, reducing your time to complete the analysis significantly. We can even change the bin sizes of the data simply using a slider.
The art of data science and data engineering is not about being able to write every line of code. It’s about understanding your data, understanding relationships and what those relationships mean in relation to the questions you are trying to answer with the data.
This tool speeds up the process to answer all of those questions – giving the business more insight, faster. The real selling point for me here, is that it generates the code for you – that is a game changer.