## An introduction to terminology

Before we get into this section, I would like to introduce you to some terminology which will appear throughout.

**Cardinality** is a measure of the size of a set. In data terms, this means, how many unique values are there in a field. Having lots of unique categorical variables can add noise to the data, especially if some occur rarely.

**Classification **models seek to predict a class or category (e.g. YES or NO).

**Curse of Dimensionality **is the concept where two data points in 1D space can be very close together, when we move to a multi-dimensional space, those same points can split apart. So in 1D space, two points might be right next to one another, but in 3D space, they may be really very far away. This makes identifying relationships between features harder.

**Features **are the columns we ingest into the machine learning model.

**Labels **are the ‘truth’ in our training dataset. For example, if we are using data to build a model to predict if someone has diabetes or not; whether the patient actually had diabetes is the label (the answer).

**Loss** describes how poor a models prediction was. If the model predicted a result of 10 and the real outcome was 4, the loss was 6. We can train our model (by setting weightings and parameters) to reduce loss as much as possible.

**Models **define the relationship between the features & the label. In other words, what happens to X to predict Y.

**Multicollinearity** is where our input variables (x and z) are highly correlated to one another. Understanding the effect of x on y is tough to work out, because it’s hard to tell whether x or z affected y. This is mostly a problem for feature selection and does not generally impact model accuracy.

**Normalization** makes the mean and standard deviation of the field ‘normal’

**Overfitting** is where we create a model that is perfectly aligned with training data. This means, when it sees new data, it is unable to make accurate predictions.

**Regression **models seeks to predict a continuous value.

**Reliability** is the level to which you can trust the data you have. If you have unreliable data (where the labels are incorrectly assigned; has a lot of anomalies or has a large proportion of null values), your data will be unable to provide useful predictions.

**Underfitting **is the opposite to overfitting where the model can’t make accurate predictions because the model has not captured enough of the complexity of the training data.

## Machine Learning Introduction

Data analytics can give us incredibly useful insights about our businesses. If we consider the example of retail sales, we may notice a trend that shows that every October sales spike and every June they fall. This analysis and subsequent identification of seasonality leads to better stock planning by the business and hence ensures that they have enough stock to supply the demand in the busy season and ensures that they aren’t holding stock that they don’t need during the quieter business periods.

But what if we wanted to predict the number of sales we expect to see? This would require a machine learning algorithm. This process would begin by enriching your sales data with other datapoints which could influence the number of units sold. For example:

- Weather may influence the sales of your product. If we have an atypically warm October, could that lead to lower sales? Conversely, if we had a cold spring could that push up sales in an otherwise quieter period?
- If you are selling high value items, the availability of credit and the current interest rates may influence your sales significantly.
- Competitors special offers and sales may influence your revenue as customers can find a better offer elsewhere.

By enriching your historical data with this information; our machine learning algorithm will be able to identify relationships between all this input criteria and your sales values (the thing you are trying to predict) to provide a good estimation of future sales, given this historical data.

Note: in machine learning, we call all the input data ‘features’. In our example: date, weather information, interest rates and competitors offers would be features. These are the bits of information we believe will influence the sales value for that given date. It is our hypothesis that the two below rows would result in different sales values:

In the above, the perception is that a sunny day, with low interest rates and no competing offers would result in higher sales for our business. So, our algorithm will find the relationship between these four fields and the target field; the target is the thing we are trying to predict, in this case it’s sales.

We can summarise by saying that: machine learning uses statistical techniques to give computer systems the ability to ‘learn’ rather than being explicitly programmed. By learning from historical inputs we’re able to achieve far greater accuracy in our predictions & constantly refine the model with new data.

Let’s look at the process to train a machine learning model. First, we split our dataset into training and test sets. In the below dataset, we would split 80% of it as a training set (blue) and 20% as a test set (yellow).

As below, the training set is sent with the outcome (the target value), so the model can determine the relationship between Column1, Column2 and the outcome.

The test set does not include the outcome (as below), the model tries to predict what the outcome would be, based on the relationship it identified from your test data set.

The model then makes predictions based on the test dataset (as below). We then compare those predictions against what really happened to determine how accurate the model is. In the below example, the model has 50% accuracy, as it made one correct prediction and one incorrect prediction.

If the model is not accurate enough, we work to improve it, otherwise we deploy it and start using it in the field.

The below diagram describes the process we just covered visually. Note that this diagram is designed to cover supervised learning models. The reason for this is that in practice, we’re almost always implementing supervised models.

We will delve into these concepts further later in the chapter and we will also review some of the most common machine learning algorithms that we can apply to our data to solve our business problem.

## Stages of a Machine Learning project

The first stage in any project, irrespective of whether it’s a machine learning initiative is to fully understand the problem.This isn’t just about understanding the task ‘to build a model to predict the likelihood of X’; it’s also about understanding why we need the model and the business value of the output.

This is really critical, as without understanding why you’re doing it and who’s going to take action on the output, how do you know how best to deliver the model? It can only be your best guess. So, you need to apply your business domain expertise to understand the use-case for the model output and what the business hopes to achieve from it.

Next, we have a research stage. This is all about how you may go about solving the problem. If your industry has a well documented and well accepted ‘best practice’ or ‘industry standard’ to solve your problem, then you don’t need to reinvent the wheel. So, this is about understanding what’s already available and what existing research you could reuse. In the event that there are no off the shelf solutions, it’s about researching other methodologies you could use to solve the problem; understanding what data you would need to obtain to do so, and so on.

Data exploration is one the most interesting phases of the project. There is no real framework or process to doing this – it’s looking through the data in a relatively unstructured manner to find patterns or characteristics of the data which may be of interest. We can look at some initial correlations; data completeness; accuracy; duplication; distributions and so on to get a good understanding of what we have to work with.

Now we get into the data pre-processing. We speak about this in an upcoming section but this is about getting your data ready for the machine learning model. It’s about handling missing data; duplicate data; creating new features upon which you’ll be able to make better correlations and predictions and it’s about preparing your data to be ingested by the machine learning model you’ve chosen.

Based on the data we have available to us, you will have a view on which model you want to use. If you have continuous data to predict, you want a regression model; if you have categorical information you’ll want a classifier – each model (some of which are discussed below) has its advantages and disadvantages – based on these, you will choose the model most suitable to your problem statement.

Next we train, test and tune the model. This is the process of showing the model some historic data so that it can build a ‘model’ and understand relationships. We then provide the model with some new (unseen) data to test how accurately it predicts the outcome. If the outcome is not very good, we tune the model to make it better.

Finally, we release the model to the business to be used to enhance decision making.