ML Series: Exploring Our Data

In the previous article, I highlighted the below phases as being part of the ML workflow.

  1. Data cleansing, formatting
  2. Data exploration
  3. Feature engineering and feature selection
  4. Initial machine learning model implementation
  5. Comparison of different models
  6. Hyper parameter tuning on the best model
  7. Evaluation of the model accuracy on testing data set
  8. Understanding the model

In todays article, we’re going to look at data exploration in a bit of detail. There is a bit of an overlap here with the data cleaning and formatting tasks we discussed in the previous article – you can think of these two things as a single task; but for simplicities sake, it’s easier to explain them in two smaller articles.

Data exploration is a critical part of the machine learning workflow. Without understanding exactly what you’re working with, it’s going to be tough to determine whether particular features are going to contribute heavily towards your model accuracy or not. Also, it will become evident very quickly if you have outliers or some poor quality data in certain areas.

The question then becomes: what are we looking for and what is the easiest way to find it? Well, we don’t know what we’re looking for exactly, that’s the point. We need to see what we have and determine if anything doesn’t quite look right.

Aggregation & Analysis

Let’s take the example of a school register. Each day, we have a list of all the students, the class they’re in & whether they attended or not. We then, do some work in Pandas or in a visualization tool like Tableau to aggregate the data & simply calculate the total number of students for each day. We get the below graph.

Clearly, there is a problem here. Why would students not have turned up on Friday? We then can dig further into the data & determine whether this is a data quality issue; missing data issue or whether it’s a genuine reflection of reality.

Understanding the data you have and any rules around that data is key. For example, it may simply be the case that the school has an event every Friday which means less students register for their classes. That is seasonality and as a data scientist, understanding these rules before building the model is super useful and will save you a whole bunch of headaches later down the line.

FInding relationships between features

So we can do that; we can aggregate our data & visually check that it looks OK, but we can also determine the relationship between two features. Why is this useful? Well, if you have two features that are heavily correlated together & you have a dimensionality challenge, you may choose to drop one of those features from the model; as its inclusion adds no value over the second field which it so closely correlates with.

We can build this view with a correlation matrix like the one below I used in our Wine ML project post. It’s always valuable keeping the target (or truth or label) value in this matrix as you can see whether you have a particular feature which by itself has a very high correlation to the outcome.

From this correlation matrix you can determine the fields which will impact your model the most and you can identify those which just add noise to the model without adding much value – those can be dropped to assist with dimensionality reduction.

Analysing single feature in depth

The other key task is to go through each of those key features you identified from your correlation matrix and to determine a few things:

  1. What values live in the fields
  2. What are the range of values for the feature
  3. How is the data distributed
  4. What percent of the datapoints are nulls
  5. What percent of the datapoints are unique

Armed with all this information, you can really ensure that you fully understand your dataset and go into model creation with the best possible opportunity to make it a success.

Carrying out cleansing

Finding nulls, removing duplicates and all the other great things we discussed in the previous article are all part of exploration. Make sure you have a read and get your data cleaned as part of this project phase.