The Ultimate Guide To Linear Regression For Aspiring Data Scientists

Regression is about finding the relationship between some input variable(s) and an outcome. Let’s think about a simple example of height and weight. We need to understand the relationship between the two – intuition can tell us that as height increases, so does weight. The idea of regression is to create a mathematical formula which describes that relationship. With this formula, we can then input a sample height value and use the formula to ‘predict’ how much that person weighs.

When you were in school, your teacher would at some point probably have asked you to draw the line of best fit onto a graph. You may have drawn it somewhere like the below. Why? Because it generalizes the data quite well and minimises the error (the distance between the line and the datapoint).

If we wanted to write a formula for this, we would get ŷ = bx + a. Where:

  • ŷ is the predicted value
  • b is the slope of the line (i.e. how much does Y increase when X increases by 1) – the coefficient
  • x is the x value – the predictor
  • a is the y-intercept (i.e. where the line of best fit crosses the Y axis) – the constant

Okay, so we know what the formula is & we know that if we had: Y increases by 2 every time X goes up by 1; the Y intercept is 10. Then the prediction where X = 6 would be: ŷ = 2(6) + 10

When should we use regression?

We should use regression when:

The dependent variable is continuous: we can’t use regression to predict a class; it must be a continuous, numeric value.

The independent variables do not correlate: the goal of the regression analysis is to identify the relationship between the independent variable (the input variable) and the dependent variable (the thing we are trying to predict). Let’s say, we have y is regressed against x and z. In this instance, x and z are highly, positively correlated to one another (multicollinearity). Understanding the effect of x on y is a tough thing to work out – it’s hard to tell whether x or z affected y, because when x goes up, so does z – so, which one impacted y?

In most cases, having correlated independent variables won’t affect your model accuracy or your ability to predict. The stumbling blocks are that it can make choosing the correct predictors a headache and it makes it hard to find out the exact effect of each predictor on Y. However, it doesn’t affect the overall fit of the model or provide bad predictions – your R squared will not be impacted by multicollinearity.

So in short, if you want to know how your group of input variables are able to predict Y and don’t care how much each predictor contributed to the prediction, you don’t need to worry about multicollinearity.

The data has some linear relationship: if we have a dataset like the above graph, linear regression makes sense. There is a relationship between X and Y which we can define with an equation. Whereas, on the below dataset, there probably is very little point – a linear equation is not going to give you the relationship between X and Y in this case.

What does all the stuff in the regression table mean?

Here is a regression table. Below, I will go through the key components.

Dep. Variable:                  Price   R-squared:                       0.898
Model:                            OLS   Adj. R-squared:                  0.888
Method:                 Least Squares   F-statistic:                     92.07
Date:                Sun, 13 Dec 2020   Prob (F-statistic):           4.04e-11
Time:                        18:54:05   Log-Likelihood:                -134.61
No. Observations:                  24   AIC:                             275.2
Df Residuals:                      21   BIC:                             278.8
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const              1798.4040    899.248      2.000      0.059     -71.685    3668.493
Inflation          345.5401    111.367      3.103      0.005     113.940     577.140
Employment_level  -250.1466    117.950     -2.121      0.046    -495.437      -4.856
==============================================================================
R SquaredR squared is also known as the coefficient of determination. It tells us how well the line fits the data & denotes the proportion of the variance in the dependent variable that is predictable from the independent variable.
If we look at the price of pizza, the price is determined mostly by the price of dough and the price of toppings. So in this regression analysis, we would have a high R squared, as the change in Y is explaniable by the X’s
To get R squared, we simply take the value of r and square it. This makes a negative value positive and presents a percentage output.
If r is 0.95 then r squared is 0.90 (90%). This means that 90% of the variance of Y is predictable.
Adjusted R SquaredAdjusted R squared is always lower than R Squared. The problem with R Squared is, the more input variables you add, R squared increases.
Adjusted R squared compares the explanatory power of regression models that contain a different number of predictors. If you have a 10 predictor model with a higher R squared than a 1 predictor model, how do you know if it’s higher because it’s a better model, or it’s higher because there are more predictors? The adjusted R squared tells us this and we can compare the value across the two models to find out.
Adjusted R squared is adjusted to take account of the number of predictors in the model. It only goes up if the new predictor improves the model and it decreases when the predictor does not improve the model.
Rr is the correlation coefficient of a dataset. r will be a value between -1 and +1. Negative one, is a perfect negative correlation; +1 is a perfect positive correlation.
P Valuesets out to prove whether the null hypothesis is true or not. Remember, the null hypothesis is a statement that says, there is no relationship between X and Y. If the p-value is less than 0.05, it indicates strong evidence against the null hypothesis. It says that, there is less than a 5% probability that the null hypothesis is correct – which means, there is likely to be a relationship between X and Y. A P-Value higher than 0.05 indicates evidence that the null hypothesis is true. The P Value can go between 0 and 1; the closer it gets to 1, the more unlikely it is that the null hypothesis is untrue. We get a P Value for every input variable.

In other words, the P value is the probability of the impact of Y, given X happening by chance and the lower the P-Value, the greater the statistical significance is

I’ve included an example here.
CoefficientThe constant value that represents the slope (b in our formula above)
F-StatisticThe F-test provides us with a measure for the overall significance of the model. It evaluates the null hypothesis that all the regression coefficients are zero (there is no linear relationship between your input variables and the output, meaning none of the independent variables matter & the model has no merit). The alternative hypothesis is that at least one input variable defers from zero. A significant F-Test tells us whether the relationship between Y and the X’s is statistically reliable. An F value close to 1 tell us that the predictor is not statistically significant. The further it gets away from 1, the better. That is to say, the lower the F statistic, the closer to a non significant model we get.
Std error of estimate (SEE)Calculate the errors between the Actual values and Estimated values. Calculated as sqrt(((estimated_Value – Actual_Value) ^2) /n-2), where n is the number of observations.

It tells us the average distance that the observed values fall from the regression line, using the units of the Y variable.

Sample Python Implementation

#Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

#Generate some test data
x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1))
y = np.array([5, 20, 14, 32, 22, 38])

#Split our data into testing & training datasetsa
train_features, test_features, train_labels, test_labels = train_test_split(x, y, test_size = 0.2, random_state = 42)

#Create our model and fit it with our training data
regressor = LinearRegression()
regressor.fit(train_features, train_labels)

#Print the Rsquared, Intercept and Slope
print('r_squared: ', regressor.score(x, y))
print('intercept:', regressor.intercept_)
print('slope: ', regressor.coef_)

#Make predictions on our test dataset
predictions = regressor.predict(test_features)

#Determine the MAE
print('Mean Absolute Error:', metrics.mean_absolute_error(test_labels, predictions))
Kodey

One thought on “The Ultimate Guide To Linear Regression For Aspiring Data Scientists

Comments are closed.