When we’re working through a data science problem, there really are a few main steps which we need to take. These are outlined below:
- Data cleansing, formatting
- Data exploration
- Feature engineering and feature selection
- Initial machine learning model implementation
- Comparison of different models
- Hyper parameter tuning on the best model
- Evaluation of the model accuracy on testing data set
- Understanding the model
Whoa! That’s a lot of stuff. Let’s go through each stage, one at a time. This article will focus on data cleansing and formatting. Tomorrow, we will look at data exploration.
So, data cleansing…. the the problem with most datasources is that they’re pretty messy. We’ll look at some of the most common problems and solutions to those problems.
Missing Values
First then, we could have missing values in our dataset. We need to decide how to handle these missing values. There are a number of options. We could:
- Simply delete all rows that contain one or more missing values
- Fill the value in the missing cell with the value immediately preceding it
- Fill the cell with a static value (e.g. 0)
- Fill the cell with a computed value (e.g. the mean of all non-null values in the column)
The way you handle the missing value will be largely influenced by your specific use case and project. If you have so much data that dropping the three rows with missing data won’t harm your analysis, you could do that. If you need every possible record & the missing values are in a cell which won’t heavily weight your algorithm, then maybe calculating the mean is OK.
Duplicate Data
The next problem is duplicate data. It’s annoying, but it happens. Sometimes we find that the same record has been input multiple times. Having this is not ideal, you don’t want to be skewing your model based on 100 customers with criteria X which turns out to be just 2 customers duplicated 50 times. So, we can drop duplicate records out of our input dataset.
Anomalies
Next, we need to start actually exploring the data. In doing so, we may find some anomalies in the data which are going to skew our outputs. We can remove anomalies in a number of ways, discussed here & summarised below.
We could extract values within X standard deviations of the mean. Statistically, 99.7% of normally distributed data falls within 3 standard deviations from the mean, so it is not unreasonable to take values outside of 3 standard deviations as outliers – you can read more here:
Another method we could use is to drop the top and bottom 5% of data, which will reduce skew but does also risk removing legitimate datapoints.
Whichever method you use, you always run the risk of eliminating valid data points but sometimes this is a necessary evil.
Dealing with Dates
Having a date in our dataset can really improve our predictions. Imagine an ice cream company, their data has a seasonal aspect (more sales in summer), so predicting sales information without dates in our data would be inaccurate.
The problem is though, dates are horrible to work with. The first of June 2020 could be displayed as any of the below – and these are just a few examples!
- 01/06/2020
- 01/06/20
- 1/6/20
- 06/01/2020
- 06/01/20
- 6/1/20
- 01-06-2020
- 06-01-2020
- 06-Jun-2020
- 01-06-2020 08:00:22.11
Machine learning models don’t accept dates as features – they’re just meaningless strings. So we need to split those dates out into separate features (e.g. create a year, month and day column).
Structural issues
Next we need to look out for potential structural issues. This happened to me recently. I had a text field in the table and in the text someone had written \t. Now, \t is the symbol for tab, so when ingesting my tab delimited file, it mis-aligned a tonne of fields – really not ideal & it took a while to spot. But, knowing this is absolutely essential.
Let’s look at our data
When we’re working with our ML models we will need to:
- Convert our categorical data to be numeric as machine learning models do not accept categorical values.
- Scale our numerical data.
Scaling is a method used to standardise the range of data. This is important as if one field stores age (between 18 and 90) and another stores salary (between 10,000 and 200,000), the machine learning algorithm might bias its results towards the larger numbers, as it may assume they’re more important. SciKitLearn state that “If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.”
Many machine learning algorithms don’t accept / understand strings; and those that do, still perform better & produce more accurate results with numeric inputs. Let’s say we have this data in a column of our dataset: [‘red’, ‘ blue’, ‘green’, ‘yellow’].
We want to encode these values to numbers to achieve the best possible outcome from our model. If we use label encoding we can simply convert those values to numbers: [1,2,3,4].
The problem is though, numbers indicate relationship. Think about a salary field, the bigger it is, the better or it could be a risk field, the higher the number, the higher the risk. But in our example, just because purple is color 5, doesn’t make it 5 times better than red, so this approach can lead to more weight being given to certain colors & hence can bias our model.
So the lesson here is this – if your values have no relationship to one another – just like colours – red has no relationship to yellow; then you should not use label encoding. However, if your values are related (e.g. risk, where risk 5 is higher priority and should be given more weight than risk 1), then you should use it.
If you have no relationship between your datapoints, you should use one hot encoding. This creates a binary output in a new column. You can read more about that here.