Naive Bayes

Theory

Conditional Probability

In probability theory, the conditional probability of an event B is the probability that the event will occur, given the knowledge that an event A has already occurred. P(B|A) is the notation for the conditional probability of event B, given event A. If events A and B are not independent, then the probability of the intersection of A and B (the probability that both events occur) is defined by $P(A\cap B) = P(A)P(B|A)$ . That is to say, by this definition, we can derive the conditional probability P(B|A) is as follows: (assumes the probability of event A occur P(A) is greater than 0)

$P(B|A) = \frac{P(A\cap B)}{P(A)}$

For the probability of the intersection of three events A, B, and C, we can follow the definition above and express as $P(A\cap B\cap C) = P(A)P(B|A)P(C|A\cap B)$ .

Bayes Theorem

According to the conditional probability definition, Bayes theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. Bayes theorem is stated mathematically as follows:

$P(B|A) = \frac{P(A|B)P(B)}{P(A)}$

where A and B are events and P(A) is greater than 0.

is the conditional probability as above: the probability of event B occurs, given event A has already occurred.
is the probability of event A occurs, given event B has already occurred.
and are the marginal probabilities of event A and event B.

The theorem is based on stating the probability of event A is the sum of the conditional probabilities of event A given that event B has or has not occurred, which can be expressed as follows:

P(A) = P(A|B)P(B) + P(A|B^c)P(B^c)

If event A and event B are independent, then:

P(A)P(B) + P(A)P(B^c) = P(A)(P(B)+P(B^c)) = P(A)(1) = P(A)

Therefore, the Bayes theorem can also be written as the following equation:

$P(B|A) = \frac{P(A|B)P(B)}{P(A|B)P(B)+P(A|B^c)P(B^c)}$
$\Rightarrow P(B_i|A) = \frac{P(B|A_i)P(A_i)}{\sum_jP(B|A_j)P(A_j) }$

Naive Bayes Classifier

Naive Bayes classifier is a probabilistic classifier based on applying Bayes' Theorem of conditional probabilities with strong independence (naive) assumptions between each feature.

Naive Bayes is a technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. It is a family of algorithms based on the principle of Bayes Theorem: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable.

Naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood. That is to say, one can work with the naive Bayes model without accepting Bayesian probability or using any Bayesian methods. Despite the naive design and oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. An advantage of naive Bayes is that it only requires a small number of training data to estimate the parameters necessary for classification.

Method

Based on Bayes theorem, the conditional probabilities of class y_k , given the data $\vec{x}$ can be written as follows:

$P(Y = y_k|X = \vec{x}) = \frac{P(Y = y_k)P(X=\vec{x}|Y=y_k)}{P(X=\vec{x})}$

where:

: the data matrix, $n \times d$ matrix

( instances, each with dimensions (features), each belongs to one of possible classes (levels)
$\vec{x} = \{x_1, x_2, ... , x_d\} \in X$ : $\vec{x} \in X$ , an instance in the data
: the label vector, $n \times 1$ vector
: label for level

However, in practice, it is intractable to deal with joint probabilities, especially when the dimensionality of the data $\vec{x}$ is extremely high. (That is to say, is greatly large.)

Hence, "naive" Bayes assumes that each word conditionally independent of the others, given the label y_k . Therefore, the conditional probabilities of each word, given the label y_k , will not be affected by other words: $P(x_i|Y = y_k, \vec{x}) = P(x_i|Y=y_k)$ . Then the original problem above will be:

$P(Y = y_k|X=\vec{x}) = \frac{P(Y = y_k)\prod^d_i{ P(x_i|Y=y_k)} }{P(X=\vec{x})} \propto P(Y = y_k)\prod^d_i{ P(x_i|Y=y_k)}$

where:

: the data matrix, $n \times d$ matrix

( instances, each with dimensions (features), each belongs to one of possible classes (levels)
$\vec{x} = \{x_1, x_2, ... , x_d\} \in X$ : $\vec{x} \in X$ , an instance in the data
: value for feature ,
: the label vector, $n \times 1$ vector
: label for level

Since the denominator is the same across all documents, we can effectively ignore it as a constant, thereby giving us a decision: We can compute the probability for each label y_k , then perform the classification and return the label with the highest probability as our prediction, which is obtaining $\hat{y}$ as follows:

$\hat{y} = \text{argmax}_{k\in1, ..., K}P(Y=y_k)\prod^d_i{P(x_i|Y=y_k)}$

NB in Document Classification

In the document classification, we are curious about: "What is the probability of the document class, given the document itself?" If we are using individual word counts as features, then consider a document as a single instance $\vec{x}$ with features, where is the number of unique words across all the vocabulary. Conditional independence in naive Bayes imposes the bag-of-words model, where word order does not matter, only word frequency.

Then, the word probabilities can be computed with simple term frequencies. Specifically, the probability of a particular word given a label y_k , P(x_i|Y=y_k) , is the frequency that word appears in documents of label y_k divided by the total frequency that word appears in all documents.

For example, to obtain the class of label with largest probabilities, we will have to calculate the following two probabilities for each label:

$P(x_i|Y=y_k) = P(\text{word}_i|\text{label}_k) = \frac{\text{count}_i}{\text{total-count}_k}$
The conditional probability of word i, given label k, is the same as the word count of word i in label k divided by the total word count in label k.

$P(Y=y_k) = P(\text{label}_k) = \frac{\text{document-number}_k}{\text{total-document-number}_k}$
The prior probability of label k is the same as the number of documents in label k divided by the total numbers of documents in the whole training set.

Since many probabilities are multiplied, this can result in a floating point underflow. It is therefore better to perform the computation by adding logarithms transformation to the probabilities. The multiplication of the probabilities will become the summation of log-probabilities. The class with the highest log probability score is still the most probable.

Notice that if there is a word j in training data but not in testing data, the count of the word j will be 0, and the conditional probability of the word j, given label k, will be 0. Based on the equation above, the product of all probability will be 0. Since we can not always include all the words both in the training data and testing data, the problem of word count 0 will be solved by Laplace Smoothing.

Drawbacks

The conditional independence assumption of naive Bayes is so critical: it substantially reduces the number of required estimated parameters. It works well in practice, however, it has some important drawbacks to be aware of:

Not Conditionally Independent

In practice, replicates are common in the data, which are clearly dependent entities, but naive Bayes will treat them as independent to each other, given the class label.
Continuous Attributes

In most data, including document classification, are not boolean. Rather, they are continuous. We can fairly modify Naive Bayes to Gaussian Naive Bayes, where attributes are i.i.d. Gaussian, but then we assume our data are Gaussian, which may not be true in practice.
Observed data in testing but not training

In document classification, the parameters in training the naive Bayes classifier consists of word counting. However, when a word are in a testing set but not in training set, the corresponding probability will be 0 by default. The problem is, computing for a document with a single probabilty of 0 will result in a 0 for entire statement.

Optimization

Laplace Smoothing

Laplace Smoothing is also called additive smoothing or Lidstone smoothing (but it's different as Laplacian smoothing), which is a technique used to smooth categorical data.

In Naive Bayes document classifier, Laplace smoothing is used to fix the problem of zero word count. For those words in the training data but not in the testing data, Laplace smoothing can solve the problem of 0 conditional probability as follows:

$P(\text{word}_i|\text{label}_k)=\frac{\text{count}_i+1}{\text{total-count}_k+V}$

Therefore, when the word count is zero, the nominator will be 1 instead of 0, then the conditional probability will not be 0 either. V in the denominator is the distinct words number of the whole training data. We implemented Laplace smoothing to all the conditional probabilities in our Naive Bayes classifier.

Citation

[1] Conditional Porbability - Yale University, Department of Statistics

[2] Bayes Theorem - Wikipedia

[3] Naive Bayes Text Classification - Stanford University

Provide feedback

Saved searches

Use saved searches to filter your results more quickly