Naive bayes Part1
Naive Bayes algorithms are based on Bayes’ Rule, which we discussed in the previous lessons, and it works very well for Natural Language Problems like Document Classification and Spam Filtering. We’ll uncover more of the details behind it in this lesson.
Naive Bayes
The Naive Bayes Theorem is based on Bayes’ Rule which is stated as below.
“Bayes’ theorem (alternatively Bayes’ law or Bayes’ rule) describes the probability of an event, based on prior knowledge of conditions that might be related to the event.”
Bayes theorem is stated as below.
$P(A  B) = \frac{P(B  A) P(A)}{P(B)}$

$P(B)$ is the probability of $B$. It is called Evidence.

$P(A  B)$ is the conditional probability of $A$, given $B$ has occurred. It is called the Posterior Probability, meaning the probability of an event after evidence is seen.

$P(B  A)$ is the conditional probability of $B$, given $A$ has occurred. It is called the Likelihood.

$P(A)$ is the probability of $A$. It is called the Prior Probability, meaning the probability of an event before evidence is seen.
Naive Bayes methods go with the “naive” assumption of conditional independence between every pair of features given the value of the class variable.
Mathematical intuition
We will be going with a fictional dataset for the playing of Golf Game, as seen below.
Outlook  Temperature  Humidity  Windy  Play Golf 

Rainy  Hot  High  False  No 
Rainy  Hot  High  True  No 
Overcast  Hot  High  False  Yes 
Sunny  Mild  High  False  Yes 
Sunny  Cool  Normal  False  Yes 
Sunny  Cool  Normal  True  No 
Overcast  Cool  Normal  True  Yes 
Rainy  Mild  High  False  No 
Rainy  Cool  Normal  False  Yes 
Sunny  Mild  Normal  False  Yes 
Rainy  Mild  Normal  True  Yes 
Overcast  Mild  High  True  Yes 
Overcast  Hot  Normal  False  Yes 
Sunny  Mild  High  True  No 

In the above dataset the independent features($X$) are Temperature, Humidity, Outlook, and Windy.

In the above dataset the dependent feature($y$) is Play Golf.
Assumption of Naive Bayes
Naive Bayes algorithms assume that each input feature is independent, and they make an equal contribution to the outcome (Play Golf). The assumptions made by the Naive Bayes algorithms are generally not true in the real world examples but they work well in practice.
Applying Bayes’ Theorem
Applying the Bayes Theorem we get the following representation.
$P(yX) = \frac{P(Xy)P(y)}{P(X)}$
where, $y$ is class variable and $X$ is a dependent feature vector (of size $n$) where:
$X = (x_1, x_2, x_3, ..., x_n)$
From the above table, taking the first row.
$X = (Overcast, Hot, High, False)$
$y = Yes$
P(yX) here means, the probability of “Playing golf” given that the weather conditions are “Overcast Outlook”, “Temperature is hot”, “High humidity” and “no wind”.
Applying the Independence Assumption
If one event $A$ is not dependent on the other event $B$, then the events are said to be independent and their joint probability is calculated as below.
Applying the Independent Assumption to the above equation, we move as follows.
$P(yX) = \frac{P(Xy)P(y)}{P(X)}$
$P(yx_1, ..., x_n)$ = $\frac{P(y)P(x_1y)P(x_2y)...P(x_ny)}{P(x_1)P(x_2)...P(x_n)}$
which can be written as
$P(yx_1, ..., x_n)$ = $\frac{P(y)\prod_{i=1}^{n}P(x_iy)}{P(x_1)P(x_2)...P(x_n)}$
Now as the denominator remains contant for a given input. We will remove that term.
$P(yx_1, ..., x_n)$ $\propto$ ${P(y)\prod_{i=1}^{n}P(x_iy)}$
Now, we need to create a classifier model. For this, we find the probability of given set of inputs for all possible values of the class variable y and pick up the output with maximum probability. This can be expressed mathematically as:
$y = argmax_yP(y)\prod_{i=1}^{n}P(x_iy)$
So, finally, we are left with the task of calculating $P(y)$ and $P(x_i  y)$. We can use Maximum A Posteriori (MAP) estimation to estimate $P(y)$ and $P(x_i  y)$.
Please note that $P(y)$ is also called class probability and $P(x_i  y)$ is called conditional probability.
The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of $P(x_i  y)$.
Applying the Mathematical Intuition
We can construct the following tables for easing the calculations from the above dataset.
Get handson with 1200+ tech skills courses.