In the ML Monday series, I aim to go through some of the theory behind machine learning & then move into practical application. Writing machine learning code is relatively trivial, but understanding why we choose a particular model and what those models actually do, is really important.
Intro to linear regression
Linear regression provides a rule which enables us to make predictions of Y based on the X. You can think of linear regression as adding a line of best fit to a scatter plot, just like you may have done at school. The line position is determined where the sum of the squares (the space between the datapoints & the line is least).
The equation to calculate the linear regression line is below. Here:
- Y hat = the estimate value (i.e. the point of the line)
- b = the slope of the line
- x = the point on the x axis
- a = y intercept (where the line of best fit crosses the y axis)
We need to calculate b and a. To do that, we need to calculate x squared, y squared, x multiplied by y; the mean of Y and the mean of X (depicted below as X bar and Y bar).
Using the above data, we can calculate b by populating the formula below. The formula looks a little scary, but I’ve worked through it step by step below:
So now we know b. It’s 1.07. Let’s now work out a, which requires us to know b, so must be done second. A is the y intercept – imagine we were talking about the effect of education on earning potential. The y intercept would be the minimum wage it’s possible to earn, so would not be zero.
So, b is 1.07 and a is -83,3. We can now fit this into our original formula.
So, it X were 114, y would be 38.7.
Residual is what we call the difference between the estimated value and the actual outcome. A high residual means that the line doesn’t fit the data very well. In our machine learning models, we must reduce the residual as much as possible.
Coefficient of determination (r squared)
R squared is also known as the coefficient of determination. It tells us how well the line fits the data & denotes the proportion of the variance in the dependent variable that is predictable from the independent variable.
In other words, if X increases by 4, how predictable is the impact on Y?
An r squared of zero means that we cannot predict Y from X.
r is the correlation coefficient of a dataset. r will be a value between -1 and +1. Negative one, is a perfect negative correlation; +1 is a perfect positive correlation.
- Positive correlation means, when X goes up, Y goes up
- Negative correlation means, when X goes up, Y does down
- None simply means, there is no correlation
- A perfect correlation almost never happens, it’s when the dots fall perfectly on the line and there is a perfect relationship between X and Y.
To get R squared, we simply take the value of r and square it. This makes a negative value positive and presents a percentage output.
If r is 0.95 then r squared is 0.90 (90%). This means that 90% of the variance of Y is predictable.
We can compare models with r squared. If we have a low r squared, we can add more input variables to make the model more accurate.