Search

The Ultimate Guide To Linear Regression For Aspiring Data Scientists

Regression is about finding the relationship between some input variable(s) and an outcome. Let’s think about a simple example of height and weight. We need to understand the relationship between the two – intuition can tell us that as height increases, so does weight. The idea of regression is to create a mathematical formula which describes that relationship. With this formula, we can then input a sample height value and use the formula to ‘predict’ how much that person weighs.

When you were in school, your teacher would at some point probably have asked you to draw the line of best fit onto a graph. You may have drawn it somewhere like the below. Why? Because it generalizes the data quite well and minimises the error (the distance between the line and the datapoint).

If we wanted to write a formula for this, we would get ŷ = bx + a. Where:

• ŷ is the predicted value
• b is the slope of the line (i.e. how much does Y increase when X increases by 1) – the coefficient
• x is the x value – the predictor
• a is the y-intercept (i.e. where the line of best fit crosses the Y axis) – the constant

Okay, so we know what the formula is & we know that if we had: Y increases by 2 every time X goes up by 1; the Y intercept is 10. Then the prediction where X = 6 would be: ŷ = 2(6) + 10

When should we use regression?

We should use regression when:

The dependent variable is continuous: we can’t use regression to predict a class; it must be a continuous, numeric value.

The independent variables do not correlate: the goal of the regression analysis is to identify the relationship between the independent variable (the input variable) and the dependent variable (the thing we are trying to predict). Let’s say, we have y is regressed against x and z. In this instance, x and z are highly, positively correlated to one another (multicollinearity). Understanding the effect of x on y is a tough thing to work out – it’s hard to tell whether x or z affected y, because when x goes up, so does z – so, which one impacted y?

In most cases, having correlated independent variables won’t affect your model accuracy or your ability to predict. The stumbling blocks are that it can make choosing the correct predictors a headache and it makes it hard to find out the exact effect of each predictor on Y. However, it doesn’t affect the overall fit of the model or provide bad predictions – your R squared will not be impacted by multicollinearity.

So in short, if you want to know how your group of input variables are able to predict Y and don’t care how much each predictor contributed to the prediction, you don’t need to worry about multicollinearity.

The data has some linear relationship: if we have a dataset like the above graph, linear regression makes sense. There is a relationship between X and Y which we can define with an equation. Whereas, on the below dataset, there probably is very little point – a linear equation is not going to give you the relationship between X and Y in this case.

What does all the stuff in the regression table mean?

Here is a regression table. Below, I will go through the key components.

```Dep. Variable:                  Price   R-squared:                       0.898
Method:                 Least Squares   F-statistic:                     92.07
Date:                Sun, 13 Dec 2020   Prob (F-statistic):           4.04e-11
Time:                        18:54:05   Log-Likelihood:                -134.61
No. Observations:                  24   AIC:                             275.2
Df Residuals:                      21   BIC:                             278.8
Df Model:                           2
Covariance Type:            nonrobust
=====================================================================================
coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const              1798.4040    899.248      2.000      0.059     -71.685    3668.493
Inflation          345.5401    111.367      3.103      0.005     113.940     577.140
Employment_level  -250.1466    117.950     -2.121      0.046    -495.437      -4.856
==============================================================================
```

Sample Python Implementation

```#Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
#Generate some test data
x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1))
y = np.array([5, 20, 14, 32, 22, 38])
#Split our data into testing & training datasetsa
train_features, test_features, train_labels, test_labels = train_test_split(x, y, test_size = 0.2, random_state = 42)
#Create our model and fit it with our training data
regressor = LinearRegression()
regressor.fit(train_features, train_labels)
#Print the Rsquared, Intercept and Slope
print('r_squared: ', regressor.score(x, y))
print('intercept:', regressor.intercept_)
print('slope: ', regressor.coef_)
#Make predictions on our test dataset
predictions = regressor.predict(test_features)
#Determine the MAE
print('Mean Absolute Error:', metrics.mean_absolute_error(test_labels, predictions))
```
Share the Post:

Related Posts

Diving into the Polars Library in Python

Data manipulation in Python requires swift, efficient tools to handle extensive datasets. Enter Polars, a high-performance data manipulation library tailored