A Naive Bayes algorithm is a classifier. It takes into account the probability of each feature occuring and determines the overall probability of the target class (outcome). From that, it takes the highest probabiltiy and returns that as its prediction.
The reason it’s naive is, it acts as though features don’t depend on one another in any way and have no relationship to one another and calculates the probability of seeing them together as the product of their individual probabilities. In reality, this is not usually the case.
However, a Naive Bayes algorithm is easy to implement and is blazing fast to compute. So often it’s used for real-time predictions. It’s generally more accurate than a logistic regression model; doesn’t require very much training data and it even performs well with categorical inputs.
But….. as I said, in real life, we don’t often get totally unrelated features, there is often a relationship between them. Also, if your real data is ingested and has a category that isn’t in the training set, the model will assign a zero probability, because there is a zero frequency.
Mostly, Naive Bayes is used for text classification: sentiment analysis, spam filtering etc…
Let’s work through it
Below is an example where we are predicting, based on the weather, whether we will need to wear a hat. The input data is as below.
Outlook | Temperature | Humidity | Wind | Hat |
Sunny | Mild | Low | Yes | Yes |
Rainy | Cold | High | Yes | Yes |
Sunny | Hot | High | No | No |
Overcast | Mild | Medium | No | No |
Sunny | Mild | Medium | No | No |
Rainy | Mild | Low | Yes | Yes |
Sunny | Hot | High | Yes | Yes |
Overcast | Cold | High | Yes | Yes |
First, determine the probability of our target class (yes or no) denoted as p(c)
p(yes) = 5/8
p(no) = 3/8
Now we determine the probability of each of the features:
Outlook:
YES | NO | P(yes) | P(no) | |
Sunny | 4 | 4 | 4/8 | 4/8 |
Rainy | 2 | 6 | 2/8 | 6/8 |
Overcast | 2 | 6 | 2/8 | 6/8 |
Temperature:
YES | NO | P(yes) | P(no) | |
Hot | 2 | 6 | 2/8 | 6/8 |
Mild | 4 | 4 | 4/8 | 4/8 |
Cold | 2 | 6 | 2/8 | 6/8 |
Humidity:
YES | NO | P(yes) | P(no) | |
Low | 2 | 6 | 2/8 | 6/8 |
Medium | 2 | 6 | 2/8 | 6/8 |
High | 4 | 4 | 4/8 | 4/8 |
Wind:
YES | NO | P(yes) | P(no) | |
Yes | 5 | 3 | 5/8 | 3/8 |
No | 3 | 5 | 3/8 | 5/8 |
If we need to classify this new piece of input data…
Outlook | Temperature | Humidity | Wind |
Sunny | Mild | High | Yes |
We need to calculate the probability that we do need a hat:
- P(outlook = sunny & hat = yes) 2/8
- P(temperature = mild & hat = yes) = 2/8
- P(humidity = high & hat = yes) = 3/8
- P(wind = yes & hat = yes) = 5/8
- P(hat= yes & hat = yes) = 5/8
The calculation then multiplies everything together: (2/8)* (2/8)* (3/8)* (5/8)* (5/8) = 0.00915527343
Now, we work out the probability of not needing a hat, given all of those input parameters.
- P(outlook = sunny & hat = no) 6/8
- P(temperature = mild & hat = no) = 6/8
- P(humidity = high & hat = no) = 5/8
- P(wind = yes & hat = no) = 3/8
- P(hat= yes & hat = no) = 3/8
The calculation then multiplies everything together: (6/8)* (6/8)* (5/8)* (3/8)* (3/8) = 0.04943847656
We then need to calculate the probability of each feature being yes (which we did above).
- p(x) = P(outlook = sunny) * p(temperature = mild) * p(humidity = high) * p(wind = yes)
- p(x) = (4/8)*(4/8)*(4/8)*(5/8)
- p(x) = 0.078125
Finally, we divide the outputs from above by our new number:
- P(hat=yes) = 0.00915527343/0.078125
- P(hat=no) = 0.04943847656/0.078125
Hence…
- P(hat=yes) = 0.1171874999
- P(hat=no) = 0.63281249996
The highest value wins, so no, we don’t need a hat.