Percentiles are 100 equal groups into which a population can be divided according to the distribution of values. A percentile can be between 1 and 99 – whatever number you pick, X% should fall below that number. For example, if you’re in the 60th percentile, you should be greater than 60% of all other observations.
In the below example, 20 people take a test, which has possible scores from 0 to 5. The test is pretty easy, so everybody scores 4 or above. That means, taking the 25th percentile of the data will take members, even if they scored 4/5.
We often use quartiles rather than percentiles. The 1st quartile = 25th percentile; the 2nd quartile = 50th percentile (also the median) and the 3rd quartile = 75th percentile.
In order to calculate the quartiles, we:
- Order our data from smallest to largest
- Calculate the median (equal to the 2nd quartile)
- Find the median of the lower half of the data (1st quartile)
- Find the median of the upper half of the data (3rd quartile)
In the below example, the median falls between two values and becomes 3.5. We therefore re-use 3 and 4 when calculating the medians for the upper and lower datasets.
This is not the case when the median is a whole number – that number is not used to calculate the 1st and 3rd quartiles.
The interquartile range is the third quartile minus the 1st quartile (in the above, 6-2 = 4).
So, what about box plots. Let’s start with an example. We have the below features in our dataset.
- Max datapoint: 364
- Min datapoint: 41
- Q1 = 90
- Q2 = 126
- Q3 = 254
This leads us to produce the below box plot.
This shows us how closely clustered the central 50% of data is around the median and gives us a pictorial view of the entire data range. Other examples include: