ML Series: An Introduction To Naive Bayes

A Naive Bayes algorithm is a classifier. It takes into account the probability of each feature occuring and determines the overall probability of the target class (outcome). From that, it takes the highest probabiltiy and returns that as its prediction.

The reason it’s naive is, it acts as though features don’t depend on one another in any way and have no relationship to one another and calculates the probability of seeing them together as the product of their individual probabilities. In reality, this is not usually the case.

However, a Naive Bayes algorithm is easy to implement and is blazing fast to compute. So often it’s used for real-time predictions. It’s generally more accurate than a logistic regression model; doesn’t require very much training data and it even performs well with categorical inputs.

But….. as I said, in real life, we don’t often get totally unrelated features, there is often a relationship between them. Also, if your real data is ingested and has a category that isn’t in the training set, the model will assign a zero probability, because there is a zero frequency.

Mostly, Naive Bayes is used for text classification: sentiment analysis, spam filtering etc…

Let’s work through it

Below is an example where we are predicting, based on the weather, whether we will need to wear a hat. The input data is as below.

OutlookTemperatureHumidity WindHat
SunnyMildLowYesYes
RainyColdHighYesYes
SunnyHotHighNoNo
OvercastMildMediumNoNo
SunnyMildMediumNoNo
RainyMildLowYesYes
SunnyHotHighYesYes
OvercastColdHighYesYes

First, determine the probability of our target class (yes or no) denoted as p(c) 

p(yes) = 5/8 

p(no) = 3/8

Now we determine the probability of each of the features:

Outlook:

YESNOP(yes)P(no)
Sunny454/84/8
Rainy262/86/8
Overcast262/86/8

Temperature:

YESNOP(yes)P(no)
Hot262/86/8
Mild444/84/8
Cold262/86/8

Humidity:

YESNOP(yes)P(no)
Low262/86/8
Medium262/86/8
High444/84/8

Wind:

YESNOP(yes)P(no)
Yes535/83/8
No353/85/8

If we need to classify this new piece of input data…

OutlookTemperatureHumidity Wind
SunnyMildHighYes

We need to calculate the probability that we do need a hat:

P(outlook = sunny & hat = yes) 1/8
P(temperature = mild & hat = yes) = 2/8
P(humidity = high & hat = yes) = 3/8
P(wind = yes & hat = yes) = 5/8
P(hat= yes & hat = yes) = 5/8

The calculation then multiplies everything together: (1/8)* (2/8)* (3/8)* (5/8)* (5/8) = 0.0457763671

Now, we work out the probability of not needing a hat, given all of those input parameters.

P(outlook = sunny & hat = no) 7/8
P(temperature = mild & hat = no) = 6/8
P(humidity = high & hat = no) = 5/8
P(wind = yes & hat = no) = 3/8
P(hat= yes & hat = no) = 3/8

The calculation then multiplies everything together: (7/8)* (6/8)* (5/8)* (3/8)* (3/8) = 0.05767822265

We then need to calculate the probability of each feature being yes (which we did above).

p(x) = P(outlook = sunny) * p(temperature = mild) * p(humidity = high) * p(wind = yes)

p(x) = (4/8)*(4/8)*(4/8)*(5/8) 

p(x) = 0.078125

Finally, we divide the outputs from above by our new number:

P(hat=yes) = 0.0457763671/0.078125 
P(hat=no) = 0.05767822265/0.078125

Hence…
P(hat=yes) = 0.58593749888 
P(hat=no) = 0.73828124992

The highest value wins, so no, we don’t need a hat.