As data scientists, we need to be comfortable with mathematics. If you Google what you need to know, you’ll find answers stating you need to fully understand linear algebra; calculus and how to calculate all of the algorthms we use by hand. I’m not going to downplay the importance of understanding how the algorithm works, but I dispute the need to be able to calculate them by hand. We didn’t create the computer so we could continue to do everything ourselves – we just need to understand what they’re doing, so we can interpret the results correctly.
I once heard the quote that “a data scientist should be better at math than the average programmer and better at programming than the average mathematician“. The notion being – you don’t need to be an expert in both. Consider this while working through your own development.
With that in mind, I’ve collated a list of the key statistics I believe data scientists need:
- Data Types: Numerical: Discrete; Numerical: Continuous and Categorical
- Measures of central tendency: Mean, Median, Mode; Weighted Mean
- Point estimates & confidence intervals
- Percentiles / quartiles / box plots
- Distributions: Skew; Normal Distribution; Standard Normal Distribution; Central Limit Theorem; Standard Error
- Measures of variability: Range; Variance; Standard Deviation; Z-Score
- Relationships: Correlation; Covariance
- Probabiltiy: AdditionRule; Multiplication Rule; Conditional Probability; Testing For Independence; calculating possible permutations
- Hypothesis: Null Hypothesis; Alternative Hypothesis; p-value
- Regression Tables: R; R-Squared; Adjusted R-Squared; F-Statistic; Coefficient; Standard Error of Estimate
Let’s now discuss each in some detail:
DATA TYPES
Okay, let’s start simple. We can have numerical or categorical data types. A numerical data type is, well, numeric – but it can be split into two: discrete or continious. A discrete data type is one which can accept only certain numbers – for example, rolling a dice, you can only get the numbers 1 to 6. A continuous data type is a continuous range of numbers – you could have 1.4839 for example.
A categorical data type is one which contains categories, like ‘yes’ and ‘no’ or ‘high’, ‘medium’ or ‘low’.
MEASURES OF CENTRAL TENDENCY
There are three basic measures of central tendency which we need to be comfortable with: Mean, Median and Mode. The mean is often referred to as the average, it’s simply: sum(all_values)/count(all_values). The mean of 2, 3, 7, 4 would be 4, because the sum of all the values is 16 and there are 4 values – 16/4 = 4.
The median is the middle value.If we have 2, 3, 7, 5, 1; we would first make sure they’re in ascending order 1, 2, 3, 5, 7 and then we would take the central value, which is three. The median is not sensitive to outliers like the mean is (i.e. if there is one really huge value at the end of a dataset, the mean would be increased while the median would not.
The mode is simply the number that appears most frequently. In the data set, 1, 5, 6, 5, 4, 2, the mode would be five, as it has appeared twice.
A weighted mean or weighted average takes into account differing importances of our data points. If we consider a university course – exams will have a higher weighting than quizzes or homework, so should be weighted higher (a 70% score in an exam should be worth more than a 70% score in a quiz). The formula for a weighted mean is calculated in our examples, using the below weightings – this is to say, the overall grade is influenced most by exams, then by coursework and the category with the least influence is a quiz.
- Exam 1: 30%
- Exam 2: 30%
- Quiz1: 10%
- Quiz2: 10%
- Coursework: 20%
So, if a student got the below grades in each, let’s work out their weighted average score.
- Exam 1: 80
- Exam 2: 72
- Quiz1: 99
- Quiz2: 100
- Coursework: 60%
We calculate the weighted mean in a simple manner. We take the score and multiply it by the weight. So in this case, we would have ((80*0.3)+(72*0.3)+(99*0.1)+(100*0.1)+(60*0.2)). We add all those up and get the weighted mean of 77.5%.
The challenge around a weighted average is around selecting the categories and weightings for those categories.
POINT ESTIMATES AND CONFIDENCE INTERVALS
A point estimate, is an estimate that we make based on a sample dataset. If you have sampled 1,000 students from a university that has a cohort size of 10,000, you have sampled 10% of the population. The sample mean groceries spend of the students is $60 per week, denoted as: X̄ = 60 (X̄ is the symbol we use for the sample mean). X̄ is a point estimate.
If we calculated the sample dataset variance, we would denote it as S² (rather than the population variance of σ²). This, again is a point estimate – it’s an inference we make about the population, based on a sample.
The problem with point estimates is, they are inherently unreliable. What if, you surveyed 1,000 students on campus in the evening? Well, maybe we could assume that those students are very sociable and out with their friends, while the other students, whom you did not sample, are more introverted and at home. There is bias built into this sampling mechanism.
We therefore calculate confidence intervals, which give a range of values. We may say that the sample mean is 60 and hence, the confidence intervals are between 50 and 70 dollars per week. We express confidence in percentage ‘we are 85% confident that the average student spend is between 50 and 70 dollars per week for the population’.
To calculate a confidence interval, we use a lookup table, like the one below. If for example we were looking for a confidence interval of 85% we would take 1.440 standard deviations from the mean as our confidence window. From -1.440 to + 1.440 standard deviations from the mean.
We can calculate this as X ± Z ( s / √n), where X is the mean of our sample; Z is the z score from the table below, s is the standard deviation of our sample dataset and N is the number of observations (datapoints). Hence, if we have a mean of 50, 100 datapoints; a standard deviation of 15 and wanted 85% confidence, we would have the formula: 50 ± 1.440 ( 15 / √100), which gives us an interval of 47.84 to 52.16.
Confidence Interval | Z |
80% | 1.282 |
85% | 1.440 |
90% | 1.645 |
95% | 1.960 |
99% | 2.576 |
99.5% | 2.807 |
99.9% | 3.291 |
PERCENTILES
Percentiles are 100 equal groups into which a population can be divided according to the distribution of values. A percentile can be between 1 and 99 – whatever number you pick, X% should fall below that number. For example, if you’re in the 60th percentile, you should be greater than 60% of all other observations. I’ve written about percentiles, quartiles & boxplots here.
SKEWNESS
Understanding our data is important – one of the basic things we can look out for is skew in our dataset. We have negative skew (also called left skew) – in this instance, the mean is less than the median. It’s called left skew because the outliers sit to the left of the graph.
In the centre, we see no skew – this is where the mean, median and mode are equal and the graph looks nice an symmetrical.
Finally, we have positive skew (also called right skew), where the outliers are to the right. The mean is greater than the median in this instance.
In all cases, the mode is the highest point of the chart.
DISTRIBUTIONS
A normal distribution has a mean, median and mode of equal values. The data looks symmetrical (the distribution is also known as the bell curve, symmetrical distribution and the gaussian distribution). The normal distribution has no skew (50% of values are less than the mean and 50% are greater). The highest point in the normally distributed dataset will be the mean, median and mode of the dataset.
We use standardization to achieve a standard normal distribution. Our distribution could be ~(μ, σ²). Here ~ means it’s a distribution, μ is the mean and σ² is the variance. The process of standardization converts this to a distribution with a mean of 0 and a standard deviation of 1. We achieve this by: (x – μ) / σ. Standard normal distribution is used to compare normal distributions that vary in spread and position.
In the below example, we see three distributions, one with the mean of -2. This makes it hard to compare the overlays. We can standardize the distributions to support our analysis.
The emprical rule applies to a normal distribution. This rule states that most of the data falls within three standard deviations of the mean (68% within 1 standard deviation; 95% within 2 standard deviations and 99.7% of data within 3 standard deviations). This only works on a symmetrical bell shaped curve.
Read more about normal distributions; the empirical rule and Chebyshev’s theorem here.
Read more about standard normal distributions here.
CENTRAL LIMIT THEOREM
If a population distribution is a non-normal distribution; we can normalise it by taking the mean of many sample populations from the original dataset. When we plot the means of the samples (and the number of samples with that mean), it will result in an approximately normally distributed dataset.
Wikipedia defines it as: “when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a “bell curve”) even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.”
Onlinemathlearning.com defines it as: “If samples of size n are drawn randomly from a population that has a mean of μ and a standard deviation of σ, the sample means are approximately normally distributed for sufficiently large sample sizes (n ≥ 30) regardless of the shape of the population distribution. If the population is normally distributed, the sample means are normally distributed for any size sample.”
The central limit theorem allows us to perform tests against a normal distribution. This is important because the normal distribution has the most applicability to many statistical models (such as confidence intervals).
In short, central limit theorem allows us to make inferences from a normally distributed dataset even if the original dataset is not normal.
Let’s say, we have income values. The original dataset looks like this:

This dataset has 52,800 observations. If we split this into 1,760 samples, have 30 observations per sample and take the mean salary of each, we should see a much more normal distribution.
The mean salary of the sample will be plotted on the X axis. The number of samples where that was the mean, will be the frequency on the Y axis.
STANDARD ERROR
Above, we saw that with the central limit theorem we take multiple samples from a population dataset and that, no matter what the original distribution, we can make it approximately normal by plotting the mean values of each sample, rather than each individual datapoint.
Standard error is the standard deviation of the means of our sample datasets. Let’s say we have 6 samples from our original dataset. The means of those samples are: 10, 12, 11, 10.5, 9.5, 11.1. The standard deviation of these samples is 0.88 – this is also known as the standard error.
Our standard error tells us how close to the original mean our samples are. In other words, how representative of the population is the sample? In our case, a standard error of 0.88 is pretty low variability and hence shows that these samples are likely to be very representative of the population.
MEASURES OF VARIABILITY
The range is simply the difference between the highest and lowest value in our dataset, it gives us a good indication of range.
Variance measures how spread out the data is, in relation to the mean. It’s calculated as the average squared distance from the mean – it’s (observation-mean)²/number_of_observations. Variance is denoted as σ².
The standard deviation is calculated as the square root of the variance. This means, the standard deviation is expressed in the same units as the mean while variance is expressed in squared units. A distribution with a mean of 10 and a standard deviation of 3 is the same as a distribution with a mean of 10 and a variance of 9.
A z-score is how many sigmas (standard deviations) above the mean a datapoint is. I’ve discussed z-scores in detail here.
MEASURE THE RELATIONSHIP BETWEEN FEATURES
Tech target defines correlation (correlation coefficient) as: “a statistical measure that indicates the extent to which two or more variables fluctuate together”. There are two major components to correlation: strength and type of correlation.
Let’s look at some types:
- Positive correlation means, when X goes up, Y goes up
- Negative correlation means, when X goes up, Y does down
- None simply means, there is no correlation
- A perfect correlation is when there is a perfect relationship between X and Y.
To read more about correlation, read my article, here.
While correlation is a scaled measure of relationship (between -1 and +1), covariance is the unscaled equivalent. If it’s a positive number, X and Y tend to move in the same direction; if it’s negative, they move in opposite directions. If the covariance is zero, there is no relationship between X and Y.
PROBABILITY
Probability is the likelihood of an event happening. If we think about the likelihood of rolling a 2 on a dice, you have 1/6 chance of that happening, because, there is one possible way to get two and there are six sides on a dice. If we rolled two dice, you would have a 1/12 chance. This is because, we can get the result of two in 1 way (two one’s).
If we want to find out the probability of event A or event B happening, we use the addition rule, whereas, the multiplication rule, which we discuss a little later, is used to find the probability of both events happening. The addition rule is denoted as: P(A) + P(B) – P(A∩B). Let’s say, we have two dice. The question is, ‘what is the likelihood that either (or both) dice 1 or dice 2 will equal 6.
In the below, I have a comma separated view of all possible rolls. I have highligted all instances where at least one of the dice is equal to 6 in orange. The formula then is: 6/36 + 6/36 – 1/36 = 11/36 (30.56%). Why did we subtract one? Because, in one instance (the bottom right), both dice were equal to 6. We don’t want to double count this overlap, so we subtract that instance.
1,1 (dice1, dice2) | 2,1 | 3,1 | 4,1 | 5,1 | 6,1 |
1,2 | 2,2 | 3,2 | 4,2 | 5,2 | 6,2 |
1,3 | 2,3 | 3,3 | 4,3 | 5,3 | 6,3 |
1,4 | 2,4 | 3,4 | 4,4 | 5,4 | 6,4 |
1,5 | 2,5 | 3,5 | 4,5 | 5,5 | 6,5 |
1,6 | 2,6 | 3,6 | 4,6 | 5,6 | 6,6 |
It’s not always the case that you have to subtract the overlapping instances. In the above, I have highlighted in pink all instances where the sum of the two rolls equals either 4 or 5. In this, there is no overlap (no instance where it equals both 4 and 5), so the formula is simply 4/36 + 3/36 = 7/36 (19.4%).
Conditional probabiltiy takes the formula of P(B|A) = P(A∩B) / P(A). This is used to find out the probability of A given that B has happened. We could expand the formula out to make it clear that (A∩B) is simply the product of P(A)*P(B): P(B|A) = (P(A)*P(B)) / P(A).
As an example, if we have 5 marbles in a bag. 2 are black and 3 are red. The chance of getting a black marble in the first selection is 2/5. When we select the next marble, we have now only got 4 marbles, of which 1 is black (because we have removed one in our first selection). So, the probability of selecting a black is 1/4. If we put this into the conditional probabiltiy formula: P(B|A) = (1/4 * 2/5) / P(2/5) = (0.25 * 0.4)/0.4 gives us 0.25 (1/4).
Now we know the probability of B happening, given that A has happened, we need to multiply that by the probability of A happening, so we know the probability of them happening in sequence. The multiplication rule is written like this: p(A∩B) = p(A) * p(B|A). This means, the probability of A and B happening is the probabiltiy of A multiplied by the probability of B given that A has already happened. This formula applies for independent and dependent events. In our marble example, we would have 0.4*0.25 = 0.1 as the likelihood of these happening in sequence.
If you were calculating this for more than two events, you would write it as P(A∩B∩C) = P(A)*P(B|A)P(C|(A∩B)) – where the pipe (|) means ‘given’ (i.e. probabiltiy of B given A has occured).
If we have truly independent events (whereby event A has no bearing on event B), we can use the specific multiplication rule. This is simply p(A∩B) = p(A) * p(B). Independent events could be owning a dog & liking pizza. This is referred to as joint probability, that is, the probability of events A and B occuring.
Sometimes, we will want to know the probability of event A or event B occuring. For that, the formula is P(A) + P(B) – P(A AND B). Remember, the P(A AND B) or P(A∩B) is equal to simply P(A)*P(B). The process of working out the probability of one or the other events happening is called the union of events and can be denoted as P(A∪B) P(A) + P(B) – P(A∩B).
If we know the P(B|A) and want to find P(A|B), we can use Bayes Theorem. The formula for Bayes Theorem is: P(A|B) = P(B|A) * P(A) / P(B) .
Let’s say we have 2 dice, one fair and one weighted; where the weighted dice has a 50% chance of rolling a 6 and an equal chance of rolling any other number.
Let’s pose the question then: P(Choosing A Biased Dice GIVEN We rolled a 6)
So the formula expanded is: P(Choosing A Biased Dice GIVEN We rolled a 6) = P(rolling a 6 given we chose a biased dice) * p(choosing a biased dice) / p(rolling a 6).
The probability of choosing the biased die (p(A)) is 50%, because we have two dice & we choose one at random.
The probability of rolling a 6 (p(B)) is: 0.5 (probability of choosing biased dice) * 0.5 (probability of rolling a 6 on the biased dice) + 0.5 (probability of choosing a fair dice) * 0.17 (probability of rolling a 6 on the fair dice). This equals ⅓ probability.
We already know that P(B|A) which is the probability of choosing a biased dice, given we have already rolled a 6 is 50%, because that dice is weighted.
So we have P(A|B) = 0.5 * 0.5 / 0.33 = 0.76. So there is a 76% chance that we have a biased dice, given that we have rolled a 6.
Probability of A not occuring (1-p(A’)) – 1-0.4 | 0.6 |
Probability of B not occuring (1-p(B’)) – 1-0.25 | 0.75 |
Probability of A and B (p(A∩B)) both occuring (P(A)*P(B)) – 0.4 * 0.25 | 0.1 |
Probability that A or B (P(A∪B)) occur (P(A) + P(B) – P(A∩B)) – 0.4 + 0.25 – 0.1 | 0.55 |
Probability that A or B (P(AΔB)) occur but not both (P(A) + P(B) – 2P(A∩B)) – 0.4 + 0.25 – 2(0.4*0.25) | 0.45 |
Probability of neither A or B occuring – P((A∪B)’) – 1-(P(A) + P(B) – P(A∩B)) – 1-(0.4 + 0.25 – 0.1) | 0.45 |
Probability of A occuring but not B – P(A) × (1- P(B)) – 0.4 * (1-0.25) | 0.3 |
Probability of B occuring but not A – P(B) × (1- P(A)) – 0.25 * (1-0.4) | 0.15 |
Probability of A given B P(A|B) = (P(A ∩ B) / P(B)) – (0.1)/0.25 | 0.4 |
Probability of B given A P(B|A) = (P(A ∩ B) / P(A)) – (0.1)/0.4 | 0.25 |
Probability of A and B where B occurs given A occured. p(A∩B) = p(A) * p(B|A) – 0.4 * 0.25 | 0.1 |
How do we know if an event is truly independent? We can test for independence/dependence by simply using these formulae: p(A|B) = P(A) or p(B|A) = P(B). If either of these are true, then the events are independent. If we look at the probability of rain, given we tossed a heads, then the formula (as we discussed above) is: P(A ∩ B) / P(B) = (30/100 * 50/100) / (50/100) = 30/100. So given the statement p(A|B) = P(A), we can say that it’s true – the probabiltiy of A (rain) given B (getting heads) is 30 out of 100 and the probability of getting rain (irrespective of which coin you tossed) is 30/100.
Heads | Tails | TOTAL | |
Rain | 15 | 15 | 30 |
Sun | 35 | 35 | 70 |
TOTAL | 50 | 50 | 100 |
Sometimes, we might not know the total number of permutations possible from a dataset. So, we need to calculate it. If we have five people, Bob, Sam, Joanne, Sarah and Clive, how many possible ways can they be ordered?
- Bob, Sam, Joanne, Sarah, Clive
- Clive, Sarah, Bob, Joanne, Sam
- ….
We’re not going to sit and work this out by hand. We simply have the formula N! where N is the number of objects. In this case, N is 5, as we have 5 objects (names). Now, we simply say 5! = (5x4x3x2x1) = 120 possible permutations.
If we take it a step further and say that those 5 people entered a competition, but only the top 2 get prizes. How many permutations are there for the top 2 spots, when we have 5 entries. Here, we have N! / (n-x)! – where N! as we know is (5x4x3x2x1) and n-x is (5-2) as there are two possible prizes. So the formula is (5x4x3x2x1)/(3x2x1) = 20 possible permutations for the top two positions in the race.
Sometimes, order doesn’t matter, so we can use the combination rather than the permutation. For example, if we make a team of 3 from our five people listed above – we don’t care if the team is comprised of Bob, Sam, Sarah or Sarah, Sam, Bob – it’s the same team, with the same members & position is irrelevant. For this, we use the formula N! / [(n-x)! * x!]. Where, X is the number of people to be on the team and N is the number of total categories (or people). So we have 5! = [(5-3)! * x!]. So we have (5x4x3x2x1)/[(2×1) * (2×1)].
Now, what is the probability that Joanne and Sam will be on the same team? Here, we essentially have our team of 3 as Joanne, Sam, MemberX. So, we need to work out how many combinations of the remaining 2 people can fill those spots. We have 2 remaining people and 1 spot. So, we have 2! / [(2-1)! x 1!] which is, unsurprisingly in this instance, 2.
So now, we know that we had 30 combinations in total and we had 2 possible ways to fill out our 3 person team. So the 2/30 (6.66%) probabiltiy that they will both be on the team.
HYPOTHESIS TESTING & STATISTICAL SIGNIFICANCE
Let’s consider that we have calculated a sample mean of 50. We have a null hypothesis that says ‘the mean of the population is 50’ and an alternative hypothesis of ‘the mean of the population is not 50’. These are denoted as: H0 : μ = 50 and H0 : μ ≠ 50. We can consider the null hypothesis to be ‘what we think now’ and the alternative to be our opinion (we’re trying to disprove the null hypothesis, because we believe that the population mean is not equal to 50.
In the case of a linear regression; the null hypothesis is a statement that says, there is no relationship between X and Y. If the p-value is less than 0.05, it indicates strong evidence against the null hypothesis. It says that, there is less than a 5% probability that the null hypothesis is correct – which means, there is likely to be a relationship between X and Y. A P-Value higher than 0.05 indicates evidence that the null hypothesis is true. The P Value can go between 0 and 1; the closer it gets to 1, the more unlikely it is that the null hypothesis is untrue. We get a P Value for every input variable. An alternative hypothesis argues that there is a relationship between X and Y.
Alpha (α) is the probability of rejecting the null hypothesis, when it is actually true.
An example…
Hypothesis testing enables us to validate or test whether a given hypothesis is true. For example, let’s say that we have a random team selector at our local football club. We have 7 players, of which 5 will be selected each night at random. So, the odds of not being chosen each night is 2/7.
If, after 5 nights, you still haven’t been chosen, you may start to suspect, it’s not all that random. So you investigate.
- The chance of not being selected on night 1 is 2/7
- The chance of not being selected in the first 2 nights is (2/7)*(2/7)
- The chance of not being selected in the first 3 nights is (2/7)*(2/7)*(2/7)
- The chance of not being selected in the first 4 nights is (2/7)*(2/7)*(2/7)*(2/7)
- The chance of not being selected in the first 5 nights is (2/7)*(2/7)*(2/7)*(2/7)*(2/7)
Your null hypothesis is that this is random and you’ve just been unlucky. Your alternative hypothesis is that this isn’t random & someone is cheating you out of the game. The likelihood of not being chosen that many nights in a row is simply 2/7 ^5 (5 nights) – which is 0.001903968584388781, which is 0.1%. Given that this is so low, we can reject the null hypothesis.
Based on your scenario, you will need to define the threshold at which you will accept or reject the null hypothesis.
In this scenario, 0.001 is the p-value, which is the probability of this happening by random chance. There is a 0.1% probability that not being selected 5 nights in a row happened by chance. That’s one in a thousand chance that your null hypothesis is true. So our alternative hypothesis is likely correct.
REGRESSION TABLES
Here is a regression table. Below, I will go through the key components.
Dep. Variable: Price R-squared: 0.898
Model: OLS Adj. R-squared: 0.888
Method: Least Squares F-statistic: 92.07
Date: Sun, 13 Dec 2020 Prob (F-statistic): 4.04e-11
Time: 18:54:05 Log-Likelihood: -134.61
No. Observations: 24 AIC: 275.2
Df Residuals: 21 BIC: 278.8
Df Model: 2
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
const 1798.4040 899.248 2.000 0.059 -71.685 3668.493
Inflation 345.5401 111.367 3.103 0.005 113.940 577.140
Employment_level -250.1466 117.950 -2.121 0.046 -495.437 -4.856
==============================================================================
R Squared | R squared is also known as the coefficient of determination. It tells us how well the line fits the data & denotes the proportion of the variance in the dependent variable that is predictable from the independent variable. If we look at the price of pizza, the price is determined mostly by the price of dough and the price of toppings. So in this regression analysis, we would have a high R squared, as the change in Y is explaniable by the X’s To get R squared, we simply take the value of r and square it. This makes a negative value positive and presents a percentage output. If r is 0.95 then r squared is 0.90 (90%). This means that 90% of the variance of Y is predictable. |
Adjusted R Squared | Adjusted R squared is always lower than R Squared. The problem with R Squared is, the more input variables you add, R squared increases. Adjusted R squared compares the explanatory power of regression models that contain a different number of predictors. If you have a 10 predictor model with a higher R squared than a 1 predictor model, how do you know if it’s higher because it’s a better model, or it’s higher because there are more predictors? The adjusted R squared tells us this and we can compare the value across the two models to find out. Adjusted R squared is adjusted to take account of the number of predictors in the model. It only goes up if the new predictor improves the model and it decreases when the predictor does not improve the model. |
R | r is the correlation coefficient of a dataset. r will be a value between -1 and +1. Negative one, is a perfect negative correlation; +1 is a perfect positive correlation. |
P Value | sets out to prove whether the null hypothesis is true or not. Remember, the null hypothesis is a statement that says, there is no relationship between X and Y. If the p-value is less than 0.05, it indicates strong evidence against the null hypothesis. It says that, there is less than a 5% probability that the null hypothesis is correct – which means, there is likely to be a relationship between X and Y. A P-Value higher than 0.05 indicates evidence that the null hypothesis is true. The P Value can go between 0 and 1; the closer it gets to 1, the more unlikely it is that the null hypothesis is untrue. We get a P Value for every input variable. In other words, the P value is the probability of the impact of Y, given X happening by chance and the lower the P-Value, the greater the statistical significance is I’ve included an example here. |
Coefficient | The constant value that represents the slope (b in our formula above) |
F-Statistic | The F-test provides us with a measure for the overall significance of the model. It evaluates the null hypothesis that all the regression coefficients are zero (there is no linear relationship between your input variables and the output, meaning none of the independent variables matter & the model has no merit). The alternative hypothesis is that at least one input variable defers from zero. A significant F-Test tells us whether the relationship between Y and the X’s is statistically reliable. An F value close to 1 tell us that the predictor is not statistically significant. The further it gets away from 1, the better. That is to say, the lower the F statistic, the closer to a non significant model we get. |
Std error of estimate (SEE) | Calculate the errors between the Actual values and Estimated values. Calculated as sqrt(((estimated_Value – Actual_Value) ^2) /n-2), where n is the number of observations. It tells us the average distance that the observed values fall from the regression line, using the units of the Y variable. |