Naive Bayes - Part 1

Mathematical Theory

The naive bayes classifier is a simple method for classifying data points.

A very common example is when the data point is the contents of an email. A naive bayes classifier can be used to decide whether to label the email as spam or not spam (also known as ham), based on its contents. This is actually a very popular use case that has been around since 1996!

Despite its simplicity, there are some interesting math behind its derivation, and it has surprisingly good performance in real world applications.

The Non-Naive Bayes Classifier

Let us first start with the (non-naive) bayes classifier, and the fact that it is actually possible to derive a theoretical lowest possible error for any classification method!

Let $ Y $ be a random variable representing the class labels, and $ X $ be a random variable representing the data point. The bayes classifier $ b: X \to Y $ assigns a data point $ x = (x_1,...,x_n) $ to class $ b(X) $ using the rule

$$ b(x) = \underset{y}{\text{argmax}} \,\, P(Y = y \mid X = x). $$

Note that this rule requires us to know the probability $ P(Y \mid X) $ , which is usually impossible in practical settings. For the binary $ Y = 0 \text{ or } 1 $ case, this rule is equivalent to these two conditions,

$ b(x) = 0 $ if $ P(Y = 1 \mid X) < 0.5 $
$ b(x) = 1 $ if $ P(Y = 1 \mid X) >= 0.5 $

One way of writing the error/misclassification rate of the bayes classifier is

$$ P(Y \neq b(X)). $$

This is known as the bayes error rate. It is simply the probability that the bayes classifier assigns a data point to the wrong class.

Lowest Possible Classification Error

We can actually show that the bayes error rate is the lowest misclassification error achievable by any classification rule $ h: X \to Y $ !

The proof is actually not difficult, but there are a lot of details to work through. The clever tricks used to rewrite various quantities certainly does not help. Please refer to this set of lecture notes for the full derivation.

The key step is to rewrite the misclassification rate as a sum,

$$ P(Y \neq h(x)) = P(Y = 1, \, h(X) = 0) + P(Y = 0, \, h(X) = 1). $$

Then, using very clever tricks to rewrite this sum as,

$$ \mathbb{E}_X [ \,\, \mathbb{1}_{h(X)=0} \cdot P(Y = 1 \mid X) \,\, + \,\, \mathbb{1}_{h(X)=1} \cdot (1 - P(Y = 1 \mid X)) \,\, ]. $$

The $ \mathbb{1} $ refers to the indicator function,

$$ \mathbb{1}_{h(X)=k} = \begin{cases} 0 \text{,} \,\,\,\, \text{if } h(X) \neq k \\[.5em] 1 \text{,} \,\,\,\, \text{if } h(X) = k. \end{cases} $$

It should be clear from this formulation that the rule for the bayes classifier in the previous section would minimize this quantity. For example, if $ P(Y = 1 \mid X) < 0.5 $ , then having $ h(X) = 0 $ would minimize the quantity, which is exactly what the bayes classifier recommends.

Setting up Naive Bayes with Bayes Theorem

I am going to derive the naive bayes classifer using a similar approach to the Scikit-learn documentation.

Given a data point $ x = (x_1,...,x_n) $ containing $ n $ features, the probability that the class will be a particular value $ y $ , given these features is

$$ P(Y = y \mid X = x). $$

We want to pick the label $ y $ that maximizes this quantity. First, we use Bayes’ Theorem to write this probability in a different way.

$$ P(Y = y \mid X = x) = \frac{ P(Y = y) \cdot P(X = x \mid Y = y) } { P(X = x) }. $$

For a fixed data point $ x = (x_1,...,x_n) $ , the denominator is a constant over all the $ y $ classes. Therefore, it is not relevant for maximizing $ P(Y = y \mid X = x) $ . Thus, we will focus on the $ P(Y = y) \cdot P(X = x \mid Y = y) $ term in the numerator.

The Naive Assumption

The word “naive” in the name of the algorithm refers to the key assumption that the $ n $ features $ (x_1,...,x_n) $ are conditionally independent, given the class $ y $ .

Of course, this assumption is most probably not going to hold in the real world. Members of a class will almost always have dependencies among their features. For example, if we have a set of data on physical features (body fat, height, weight, feet size etc), and two classes “male” and “female”. There will certainly be correlations between the features within each class.

However this “naive” assumption makes the math a lot easier by allowing us to factor the numerator using the chain rule for probability, into a product of marginals

$$ P(Y = y) \cdot P(X = x \mid Y = y) = P(Y = y) \cdot \prod_{i = 1}^{n} P(X_i = x_i \mid Y = y). $$

We can then focus on selecting the label $ Y = y $ that maximizes this quantity. Note that $ P(X_i = x_i) $ is the probability that the $ i $ feature has value $ x_i $ .

Decision Rule and Model Training

Under the “naive” assumption, the decision rule for labeling a data point $ x $ with the label $ \hat{y} $ is

$$ \hat{y} = \underset{y}{\text{argmax}} \, P(y) \cdot \prod_{i = 1}^{n} P(X_i = x_i \mid Y = y). $$

We have to make yet another assumption, this time about $ P(X_i = x_i \mid Y = y) $ , in order to fit the model to data. There are a variety of ways to do so depending on the kind of data we have.

In our spam email example, each data point $ x = ( x_1, x_2, \dots, x_n ) $ could contain binary variables indicating whether the $ i $-th word occurs in the email. We can then assume that, given the label $ Y $, each $ X_i $ is generated by a bernoulli distribution. That is,

$$ P(X = x_i \mid Y = y) = p_i x_i + (1-p_i)(1-x_i). $$

We then train the model, or fit the model to the data, by estimating the probability $ p_i $ that the $ i $-th word appears. Note that multiplying many small numbers can be computationally unstable, but turning the product into a sum by taking logarithm would

If the $ X_i $ are categorial variables, then we can instead estimate the conditional probability using

$$ P(X = x_i \mid Y = y) = \frac{ \text{ number of times category x_i appears in class y } } { \text{ total number of data points with class label y } }. $$

Finally, if the $ X_i $ are continuous variables, a common choice is to assume that, given the label $ Y $, the variables are normally distributed.

Pseudo-count and Additive Smoothing

There is yet another problem that we have to overcome, but thankfully it is also the last one. The problem is that the quantity $ \prod_{i = 1}^{n} P(X_i = x_i \mid Y = y) = 0 $ the moment a single $ P(X_i = x_j \mid Y = y) = 0 $.

This means throwing out all the information given by all of the other $ P(X_i = x_i \mid Y = y) > 0 $, just because a single one is zero. For example, this can happen in practice if in a dataset of emails, a particular word occurs in non-spam emails, but never occurs in spam emails.

In the case when features $ X_i $ are discrete, we can simply add a “pseudocount” to make sure that each possible feature value is counted as appearing at least once, for each class label. These kinds of methods is known as additive smoothing. In the normal distribution case, when $ x_i $ are continuous, we just need to be careful to not use features with zero variance.

Surprising Real World Performance

So far, we have seen that the bayes classifier has the lowest possible error among classification algorithms. We also talked about the fact that the naive bayes classifier makes the assumption that the features are conditionally independent and thus allows us to factor $ P(X = x \mid Y = y) $ into the product $ \prod_{i = 1}^{n} P(X_i = x_i \mid Y = y) $ .

Since features in real world data points tend to be dependent, we might expect the naive assumption to lead to significant performance degradation. However, the naive bayes classifier is known to perform surprisingly well in real world applications!

An explanation for this is given by this paper titled “The Optimality of Naive Bayes”. It shows that the dependencies among feature variables could cancel each other out, or be distributed so evenly among the classes that they do not have a big impact.

Work in-progress: using naive bayes for sentiment analysis on a set of real world data.