# The Data Scientist Statistics Learning Plan For 2021

As data scientists, we need to be comfortable with mathematics. If you Google what you need to know, you’ll find answers stating you need to fully understand linear algebra; calculus and how to calculate all of the algorthms we use by hand. I’m not going to downplay the importance of understanding how the algorithm works, but I dispute the need to be able to calculate them by hand. We didn’t create the computer so we could continue to do everything ourselves – we just need to understand what they’re doing, so we can interpret the results correctly.

I once heard the quote that “a data scientist should be better at math than the average programmer and better at programming than the average mathematician“. The notion being – you don’t need to be an expert in both. Consider this while working through your own development.

With that in mind, I’ve collated a list of the key statistics I believe data scientists need:

• Data Types: Numerical: Discrete; Numerical: Continuous and Categorical
• Measures of central tendency: Mean, Median, Mode; Weighted Mean
• Point estimates & confidence intervals
• Percentiles / quartiles / box plots
• Distributions: Skew; Normal Distribution; Standard Normal Distribution; Central Limit Theorem; Standard Error
• Measures of variability: Range; Variance; Standard Deviation; Z-Score
• Relationships: Correlation; Covariance
• Probabiltiy: AdditionRule; Multiplication Rule; Conditional Probability; Testing For Independence; calculating possible permutations
• Hypothesis: Null Hypothesis; Alternative Hypothesis; p-value
• Regression Tables: R; R-Squared; Adjusted R-Squared; F-Statistic; Coefficient; Standard Error of Estimate

Let’s now discuss each in some detail:

DATA TYPES

Okay, let’s start simple. We can have numerical or categorical data types. A numerical data type is, well, numeric – but it can be split into two: discrete or continious. A discrete data type is one which can accept only certain numbers – for example, rolling a dice, you can only get the numbers 1 to 6. A continuous data type is a continuous range of numbers – you could have 1.4839 for example.

A categorical data type is one which contains categories, like ‘yes’ and ‘no’ or ‘high’, ‘medium’ or ‘low’.

MEASURES OF CENTRAL TENDENCY

There are three basic measures of central tendency which we need to be comfortable with: Mean, Median and Mode. The mean is often referred to as the average, it’s simply: sum(all_values)/count(all_values). The mean of 2, 3, 7, 4 would be 4, because the sum of all the values is 16 and there are 4 values – 16/4 = 4.

The median is the middle value.If we have 2, 3, 7, 5, 1; we would first make sure they’re in ascending order 1, 2, 3, 5, 7 and then we would take the central value, which is three. The median is not sensitive to outliers like the mean is (i.e. if there is one really huge value at the end of a dataset, the mean would be increased while the median would not.

The mode is simply the number that appears most frequently. In the data set, 1, 5, 6, 5, 4, 2, the mode would be five, as it has appeared twice.

A weighted mean or weighted average takes into account differing importances of our data points. If we consider a university course – exams will have a higher weighting than quizzes or homework, so should be weighted higher (a 70% score in an exam should be worth more than a 70% score in a quiz). The formula for a weighted mean is calculated in our examples, using the below weightings – this is to say, the overall grade is influenced most by exams, then by coursework and the category with the least influence is a quiz.

• Exam 1: 30%
• Exam 2: 30%
• Quiz1: 10%
• Quiz2: 10%
• Coursework: 20%

So, if a student got the below grades in each, let’s work out their weighted average score.

• Exam 1: 80
• Exam 2: 72
• Quiz1: 99
• Quiz2: 100
• Coursework: 60%

We calculate the weighted mean in a simple manner. We take the score and multiply it by the weight. So in this case, we would have ((80*0.3)+(72*0.3)+(99*0.1)+(100*0.1)+(60*0.2)). We add all those up and get the weighted mean of 77.5%.

The challenge around a weighted average is around selecting the categories and weightings for those categories.

POINT ESTIMATES AND CONFIDENCE INTERVALS

A point estimate, is an estimate that we make based on a sample dataset. If you have sampled 1,000 students from a university that has a cohort size of 10,000, you have sampled 10% of the population. The sample mean groceries spend of the students is \$60 per week, denoted as: X̄ = 60 (X̄ is the symbol we use for the sample mean). X̄ is a point estimate.

If we calculated the sample dataset variance, we would denote it as S² (rather than the population variance of σ²). This, again is a point estimate – it’s an inference we make about the population, based on a sample.

The problem with point estimates is, they are inherently unreliable. What if, you surveyed 1,000 students on campus in the evening? Well, maybe we could assume that those students are very sociable and out with their friends, while the other students, whom you did not sample, are more introverted and at home. There is bias built into this sampling mechanism.

We therefore calculate confidence intervals, which give a range of values. We may say that the sample mean is 60 and hence, the confidence intervals are between 50 and 70 dollars per week. We express confidence in percentage ‘we are 85% confident that the average student spend is between 50 and 70 dollars per week for the population’.

To calculate a confidence interval, we use a lookup table, like the one below. If for example we were looking for a confidence interval of 85% we would take 1.440 standard deviations from the mean as our confidence window. From -1.440 to + 1.440 standard deviations from the mean.

We can calculate this as X  ±  Z  ( s / √n), where X is the mean of our sample; Z is the z score from the table below, s is the standard deviation of our sample dataset and N is the number of observations (datapoints). Hence, if we have a mean of 50, 100 datapoints; a standard deviation of 15 and wanted 85% confidence, we would have the formula: 50  ±  1.440  ( 15 / √100), which gives us an interval of 47.84 to 52.16.

PERCENTILES

Percentiles are 100 equal groups into which a population can be divided according to the distribution of values. A percentile can be between 1 and 99 – whatever number you pick, X% should fall below that number. For example, if you’re in the 60th percentile, you should be greater than 60% of all other observations. I’ve written about percentiles, quartiles & boxplots here.

SKEWNESS

Understanding our data is important – one of the basic things we can look out for is skew in our dataset. We have negative skew (also called left skew) – in this instance, the mean is less than the median. It’s called left skew because the outliers sit to the left of the graph.

In the centre, we see no skew – this is where the mean, median and mode are equal and the graph looks nice an symmetrical.

Finally, we have positive skew (also called right skew), where the outliers are to the right. The mean is greater than the median in this instance.

In all cases, the mode is the highest point of the chart.

DISTRIBUTIONS

A normal distribution has a mean, median and mode of equal values. The data looks symmetrical (the distribution is also known as the bell curve, symmetrical distribution and the gaussian distribution). The normal distribution has no skew (50% of values are less than the mean and 50% are greater). The highest point in the normally distributed dataset will be the mean, median and mode of the dataset.

We use standardization to achieve a standard normal distribution. Our distribution could be ~(μ, σ²). Here ~ means it’s a distribution, μ is the mean and σ² is the variance. The process of standardization converts this to a distribution with a mean of 0 and a standard deviation of 1. We achieve this by: (x – μ) / σ. Standard normal distribution is used to compare normal distributions that vary in spread and position.

In the below example, we see three distributions, one with the mean of -2. This makes it hard to compare the overlays. We can standardize the distributions to support our analysis.

The emprical rule applies to a normal distribution. This rule states that most of the data falls within three standard deviations of the mean (68% within 1 standard deviation; 95% within 2 standard deviations and 99.7% of data within 3 standard deviations). This only works on a symmetrical bell shaped curve.

Read more about normal distributions; the empirical rule and Chebyshev’s theorem here.

CENTRAL LIMIT THEOREM

If a population distribution is a non-normal distribution; we can normalise it by taking the mean of many sample populations from the original dataset. When we plot the means of the samples (and the number of samples with that mean), it will result in an approximately normally distributed dataset.

Wikipedia defines it as: “when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a “bell curve”) even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.”

Onlinemathlearning.com defines it as: “If samples of size n are drawn randomly from a population that has a mean of μ and a standard deviation of σ, the sample means are approximately normally distributed for sufficiently large sample sizes (n ≥ 30) regardless of the shape of the population distribution. If the population is normally distributed, the sample means are normally distributed for any size sample.”

The central limit theorem allows us to perform tests against a normal distribution. This is important because the normal distribution has the most applicability to many statistical models (such as confidence intervals).

In short, central limit theorem allows us to make inferences from a normally distributed dataset even if the original dataset is not normal.

Let’s say, we have income values. The original dataset looks like this:

This dataset has 52,800 observations. If we split this into 1,760 samples, have 30 observations per sample and take the mean salary of each, we should see a much more normal distribution.

The mean salary of the sample will be plotted on the X axis. The number of samples where that was the mean, will be the frequency on the Y axis.

STANDARD ERROR

Above, we saw that with the central limit theorem we take multiple samples from a population dataset and that, no matter what the original distribution, we can make it approximately normal by plotting the mean values of each sample, rather than each individual datapoint.

Standard error is the standard deviation of the means of our sample datasets. Let’s say we have 6 samples from our original dataset. The means of those samples are: 10, 12, 11, 10.5, 9.5, 11.1. The standard deviation of these samples is 0.88 – this is also known as the standard error.

Our standard error tells us how close to the original mean our samples are. In other words, how representative of the population is the sample? In our case, a standard error of 0.88 is pretty low variability and hence shows that these samples are likely to be very representative of the population.

MEASURES OF VARIABILITY

The range is simply the difference between the highest and lowest value in our dataset, it gives us a good indication of range.

Variance measures how spread out the data is, in relation to the mean. It’s calculated as the average squared distance from the mean – it’s (observation-mean)²/number_of_observations. Variance is denoted as σ².

The standard deviation is calculated as the square root of the variance. This means, the standard deviation is expressed in the same units as the mean while variance is expressed in squared units. A distribution with a mean of 10 and a standard deviation of 3 is the same as a distribution with a mean of 10 and a variance of 9.

A z-score is how many sigmas (standard deviations) above the mean a datapoint is. I’ve discussed z-scores in detail here.

MEASURE THE RELATIONSHIP BETWEEN FEATURES

Tech target defines correlation (correlation coefficient) as: “a statistical measure that indicates the extent to which two or more variables fluctuate together”. There are two major components to correlation: strength and type of correlation.

Let’s look at some types:

• Positive correlation means, when X goes up, Y goes up
• Negative correlation means, when X goes up, Y does down
• None simply means, there is no correlation
• A perfect correlation is when there is a perfect relationship between X and Y.

While correlation is a scaled measure of relationship (between -1 and +1), covariance is the unscaled equivalent. If it’s a positive number, X and Y tend to move in the same direction; if it’s negative, they move in opposite directions. If the covariance is zero, there is no relationship between X and Y.

PROBABILITY

Probability is the likelihood of an event happening. If we think about the likelihood of rolling a 2 on a dice, you have 1/6 chance of that happening, because, there is one possible way to get two and there are six sides on a dice. If we rolled two dice, you would have a 1/12 chance. This is because, we can get the result of two in 1 way (two one’s).

If we want to find out the probability of event A or event B happening, we use the addition rule, whereas, the multiplication rule, which we discuss a little later, is used to find the probability of both events happening. The addition rule is denoted as: P(A) + P(B) – P(A∩B). Let’s say, we have two dice. The question is, ‘what is the likelihood that either (or both) dice 1 or dice 2 will equal 6.

In the below, I have a comma separated view of all possible rolls. I have highligted all instances where at least one of the dice is equal to 6 in orange. The formula then is: 6/36 + 6/36 – 1/36 = 11/36 (30.56%). Why did we subtract one? Because, in one instance (the bottom right), both dice were equal to 6. We don’t want to double count this overlap, so we subtract that instance.

It’s not always the case that you have to subtract the overlapping instances. In the above, I have highlighted in pink all instances where the sum of the two rolls equals either 4 or 5. In this, there is no overlap (no instance where it equals both 4 and 5), so the formula is simply 4/36 + 3/36 = 7/36 (19.4%).

Conditional probabiltiy takes the formula of P(B|A) = P(A∩B) / P(A). This is used to find out the probability of A given that B has happened. We could expand the formula out to make it clear that (A∩B) is simply the product of P(A)*P(B): P(B|A) = (P(A)*P(B)) / P(A).

As an example, if we have 5 marbles in a bag. 2 are black and 3 are red. The chance of getting a black marble in the first selection is 2/5. When we select the next marble, we have now only got 4 marbles, of which 1 is black (because we have removed one in our first selection). So, the probability of selecting a black is 1/4. If we put this into the conditional probabiltiy formula: P(B|A) = (1/4 * 2/5) / P(2/5) = (0.25 * 0.4)/0.4 gives us 0.25 (1/4).

Now we know the probability of B happening, given that A has happened, we need to multiply that by the probability of A happening, so we know the probability of them happening in sequence. The multiplication rule is written like this: p(A∩B) = p(A) * p(B|A). This means, the probability of A and B happening is the probabiltiy of A multiplied by the probability of B given that A has already happened. This formula applies for independent and dependent events. In our marble example, we would have 0.4*0.25 = 0.1 as the likelihood of these happening in sequence.

If you were calculating this for more than two events, you would write it as P(A∩B∩C) = P(A)*P(B|A)P(C|(A∩B)) – where the pipe (|) means ‘given’ (i.e. probabiltiy of B given A has occured).

If we have truly independent events (whereby event A has no bearing on event B), we can use the specific multiplication rule. This is simply p(A∩B) = p(A) * p(B). Independent events could be owning a dog & liking pizza. This is referred to as joint probability, that is, the probability of events A and B occuring.

Sometimes, we will want to know the probability of event A or event B occuring. For that, the formula is P(A) + P(B) – P(A AND B). Remember, the P(A AND B) or P(A∩B) is equal to simply P(A)*P(B). The process of working out the probability of one or the other events happening is called the union of events and can be denoted as P(A∪B) P(A) + P(B) – P(A∩B).

If we know the P(B|A) and want to find P(A|B), we can use Bayes Theorem. The formula for Bayes Theorem is: P(A|B)‌ ‌=‌ ‌P(B|A)‌ ‌*‌ ‌P(A)‌ ‌/‌ ‌P(B)‌ ‌.

Let’s say we have ‌2‌ ‌dice,‌ ‌one‌ ‌fair‌ ‌and‌ ‌one‌ ‌weighted;‌ ‌where‌ ‌the‌ ‌weighted‌ ‌dice‌ ‌has‌ ‌a‌ ‌50%‌ ‌chance‌ ‌of‌ ‌rolling‌ ‌a‌ ‌6‌ ‌and‌ ‌an‌ ‌equal‌ ‌chance‌ ‌of‌ ‌rolling‌ ‌any‌ ‌other‌ ‌number.‌ ‌ ‌

Let’s‌ ‌pose‌ ‌the‌ ‌question‌ ‌then:‌ ‌ ‌P(Choosing‌ ‌A‌ ‌Biased‌ ‌Dice‌ ‌GIVEN‌ ‌We‌ ‌rolled‌ ‌a‌ ‌6)‌ ‌

‌So‌ ‌the‌ ‌formula‌ ‌expanded‌ ‌is:‌ ‌ ‌ ‌P(Choosing‌ ‌A‌ ‌Biased‌ ‌Dice‌ ‌GIVEN‌ ‌We‌ ‌rolled‌ ‌a‌ ‌6)‌ ‌=‌ ‌P(rolling‌ ‌a‌ ‌6‌ ‌given‌ ‌we‌ ‌chose‌ ‌a‌ ‌biased‌ ‌dice)‌ ‌*‌ ‌p(choosing‌ ‌a‌ ‌biased‌ ‌dice)‌ ‌/‌ ‌p(rolling‌ ‌a‌ ‌6).‌ ‌ ‌

The‌ ‌probability‌ ‌of‌ ‌choosing‌ ‌the‌ ‌biased‌ ‌die‌ ‌(p(A))‌ ‌is‌ ‌50%,‌ ‌because‌ ‌we‌ ‌have‌ ‌two‌ ‌dice‌ ‌&‌ ‌we‌ ‌choose‌ ‌one‌ ‌at‌ ‌random.‌ ‌ ‌

The‌ ‌probability‌ ‌of‌ ‌rolling‌ ‌a‌ ‌6‌ ‌(p(B))‌ ‌is:‌ ‌0.5‌ ‌(probability‌ ‌of‌ ‌choosing‌ ‌biased‌ ‌dice)‌ ‌*‌ ‌0.5‌ ‌(probability‌ ‌of‌ ‌rolling‌ ‌a‌ ‌6‌ ‌on‌ ‌the‌ ‌biased‌ ‌dice)‌ ‌+‌ ‌0.5‌ ‌(probability‌ ‌of‌ ‌choosing‌ ‌a‌ ‌fair‌ ‌dice)‌ ‌*‌ ‌0.17‌ ‌(probability‌ ‌of‌ ‌rolling‌ ‌a‌ ‌6‌ ‌on‌ ‌the‌ ‌fair‌ ‌dice).‌ ‌This‌ ‌equals‌ ‌⅓‌ ‌probability.‌ ‌ ‌

We‌ ‌already‌ ‌know‌ ‌that‌ ‌P(B|A)‌ ‌which‌ ‌is‌ ‌the‌ ‌probability‌ ‌of‌ ‌choosing‌ ‌a‌ ‌biased‌ ‌dice,‌ ‌given‌ ‌we‌ ‌have‌ ‌already‌ ‌rolled‌ ‌a‌ ‌6‌ ‌is‌ ‌50%,‌ ‌because‌ ‌that‌ ‌dice‌ ‌is‌ ‌weighted.‌ ‌

‌So‌ ‌we‌ ‌have‌ ‌P(A|B)‌ ‌=‌ ‌0.5‌ ‌*‌ ‌0.5‌ ‌/‌ ‌0.33‌ ‌=‌ ‌0.76.‌ ‌ ‌So‌ ‌there‌ ‌is‌ ‌a‌ ‌76%‌ ‌chance‌ ‌that‌ ‌we‌ ‌have‌ ‌a‌ ‌biased‌ ‌dice,‌ ‌given‌ ‌that‌ ‌we‌ ‌have‌ ‌rolled‌ ‌a‌ ‌6.‌ ‌ ‌

How do we know if an event is truly independent? We can test for independence/dependence by simply using these formulae: p(A|B) = P(A) or p(B|A) = P(B). If either of these are true, then the events are independent. If we look at the probability of rain, given we tossed a heads, then the formula (as we discussed above) is: P(A ∩ B) / P(B) = (30/100 * 50/100) / (50/100) = 30/100. So given the statement p(A|B) = P(A), we can say that it’s true – the probabiltiy of A (rain) given B (getting heads) is 30 out of 100 and the probability of getting rain (irrespective of which coin you tossed) is 30/100.

Sometimes, we might not know the total number of permutations possible from a dataset. So, we need to calculate it. If we have five people, Bob, Sam, Joanne, Sarah and Clive, how many possible ways can they be ordered?

• Bob, Sam, Joanne, Sarah, Clive
• Clive, Sarah, Bob, Joanne, Sam
• ….

We’re not going to sit and work this out by hand. We simply have the formula N! where N is the number of objects. In this case, N is 5, as we have 5 objects (names). Now, we simply say 5! = (5x4x3x2x1) = 120 possible permutations.

If we take it a step further and say that those 5 people entered a competition, but only the top 2 get prizes. How many permutations are there for the top 2 spots, when we have 5 entries. Here, we have N! / (n-x)! – where N! as we know is (5x4x3x2x1) and n-x is (5-2) as there are two possible prizes. So the formula is (5x4x3x2x1)/(3x2x1) = 20 possible permutations for the top two positions in the race.

Sometimes, order doesn’t matter, so we can use the combination rather than the permutation. For example, if we make a team of 3 from our five people listed above – we don’t care if the team is comprised of Bob, Sam, Sarah or Sarah, Sam, Bob – it’s the same team, with the same members & position is irrelevant. For this, we use the formula N! / [(n-x)! * x!]. Where, X is the number of people to be on the team and N is the number of total categories (or people). So we have 5! = [(5-3)! * x!]. So we have (5x4x3x2x1)/[(2×1) * (2×1)].

Now, what is the probability that Joanne and Sam will be on the same team? Here, we essentially have our team of 3 as Joanne, Sam, MemberX. So, we need to work out how many combinations of the remaining 2 people can fill those spots. We have 2 remaining people and 1 spot. So, we have 2! / [(2-1)! x 1!] which is, unsurprisingly in this instance, 2.

So now, we know that we had 30 combinations in total and we had 2 possible ways to fill out our 3 person team. So the 2/30 (6.66%) probabiltiy that they will both be on the team.

HYPOTHESIS TESTING & STATISTICAL SIGNIFICANCE

Let’s consider that we have calculated a sample mean of 50. We have a null hypothesis that says ‘the mean of the population is 50’ and an alternative hypothesis of ‘the mean of the population is not 50’. These are denoted as: H0 : μ = 50 and H0 : μ ≠ 50. We can consider the null hypothesis to be ‘what we think now’ and the alternative to be our opinion (we’re trying to disprove the null hypothesis, because we believe that the population mean is not equal to 50.

In the case of a linear regression; the null hypothesis is a statement that says, there is no relationship between X and Y. If the p-value is less than 0.05, it indicates strong evidence against the null hypothesis. It says that, there is less than a 5% probability that the null hypothesis is correct – which means, there is likely to be a relationship between X and Y. A P-Value higher than 0.05 indicates evidence that the null hypothesis is true. The P Value can go between 0 and 1; the closer it gets to 1, the more unlikely it is that the null hypothesis is untrue. We get a P Value for every input variable. An alternative hypothesis argues that there is a relationship between X and Y.

Alpha (α) is the probability of rejecting the null hypothesis, when it is actually true.

An example…

Hypothesis testing enables us to validate or test whether a given hypothesis is true. For example, let’s say that we have a random team selector at our local football club. We have 7 players, of which 5 will be selected each night at random. So, the odds of not being chosen each night is 2/7.

If, after 5 nights, you still haven’t been chosen, you may start to suspect, it’s not all that random. So you investigate.

• The chance of not being selected on night 1 is 2/7
• The chance of not being selected in the first 2 nights is (2/7)*(2/7)
• The chance of not being selected in the first 3 nights is (2/7)*(2/7)*(2/7)
• The chance of not being selected in the first 4 nights is (2/7)*(2/7)*(2/7)*(2/7)
• The chance of not being selected in the first 5 nights is (2/7)*(2/7)*(2/7)*(2/7)*(2/7)

Your null hypothesis is that this is random and you’ve just been unlucky. Your alternative hypothesis is that this isn’t random & someone is cheating you out of the game. The likelihood of not being chosen that many nights in a row is simply 2/7 ^5 (5 nights) – which is 0.001903968584388781, which is 0.1%. Given that this is so low, we can reject the null hypothesis.

Based on your scenario, you will need to define the threshold at which you will accept or reject the null hypothesis.

In this scenario, 0.001 is the p-value, which is the probability of this happening by random chance. There is a 0.1% probability that not being selected 5 nights in a row happened by chance. That’s one in a thousand chance that your null hypothesis is true. So our alternative hypothesis is likely correct.

REGRESSION TABLES

Here is a regression table. Below, I will go through the key components.

```Dep. Variable:                  Price   R-squared:                       0.898
Method:                 Least Squares   F-statistic:                     92.07
Date:                Sun, 13 Dec 2020   Prob (F-statistic):           4.04e-11
Time:                        18:54:05   Log-Likelihood:                -134.61
No. Observations:                  24   AIC:                             275.2
Df Residuals:                      21   BIC:                             278.8
Df Model:                           2
Covariance Type:            nonrobust
=====================================================================================
coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const              1798.4040    899.248      2.000      0.059     -71.685    3668.493
Inflation          345.5401    111.367      3.103      0.005     113.940     577.140
Employment_level  -250.1466    117.950     -2.121      0.046    -495.437      -4.856
==============================================================================
```

Previous Article ## A Guide To Basic Linear Algebra Notation For Machine Learning

Next Article ## Can we successfully implement Agile in data science?

##### Related Posts ## Can we successfully implement Agile in data science? ## ML Scenario Part 3: Reducing dimensionality to improve accuracy ## Outlier and Anomaly Dection Using Isolation Forest For Data Scientists 