In the ML Monday series, I aim to go through some of the theory behind machine learning & then move into practical application. Writing machine learning code is relatively trivial, but understanding why we choose a particular model and what those models actually do, is really important.

**Intro to linear regression**

Linear regression provides a rule which enables us to make predictions of Y based on the X. You can think of linear regression as adding a line of best fit to a scatter plot, just like you may have done at school. The line position is determined where the sum of the squares (the space between the datapoints & the line is least).

The equation to calculate the linear regression line is below. Here:

- Y hat = the estimate value (i.e. the point of the line)
- b = the slope of the line
- x = the point on the x axis
- a = y intercept (where the line of best fit crosses the y axis)

We need to calculate b and a. To do that, we need to calculate x squared, y squared, x multiplied by y; the mean of Y and the mean of X (depicted below as X bar and Y bar).

Using the above data, we can calculate b by populating the formula below. The formula looks a little scary, but I’ve worked through it step by step below:

So now we know b. It’s 1.07. Let’s now work out a, which requires us to know b, so must be done second. A is the y intercept – imagine we were talking about the effect of education on earning potential. The y intercept would be the minimum wage it’s possible to earn, so would not be zero.

So, b is 1.07 and a is -83,3. We can now fit this into our original formula.

So, it X were 114, y would be 38.7.

Residual is what we call the difference between the estimated value and the actual outcome. A high residual means that the line doesn’t fit the data very well. In our machine learning models, we must reduce the residual as much as possible.

**Coefficient of determination (r squared)**

R squared is also known as the coefficient of determination. It tells us how well the line fits the data & denotes the proportion of the variance in the dependent variable that is predictable from the independent variable.

In other words, if X increases by 4, how predictable is the impact on Y?

An r squared of zero means that we cannot predict Y from X.

r is the correlation coefficient of a dataset. r will be a value between -1 and +1. Negative one, is a perfect negative correlation; +1 is a perfect positive correlation.

- Positive correlation means, when X goes up, Y goes up
- Negative correlation means, when X goes up, Y does down
- None simply means, there is no correlation
- A perfect correlation almost never happens, it’s when the dots fall perfectly on the line and there is a perfect relationship between X and Y.

To get R squared, we simply take the value of r and square it. This makes a negative value positive and presents a percentage output.

If r is 0.95 then r squared is 0.90 (90%). This means that 90% of the variance of Y is predictable.

We can compare models with r squared. If we have a low r squared, we can add more input variables to make the model more accurate.