Document Classification Using Multinomial Naive Bayes Classifier
Document classification is a classical machine learning problem. If there is a set of documents that is already categorized/labeled in existing categories, the task is to automatically categorize a new document into one of the existing categories. In this blog, I will elaborate upon the machine learning technique to do this.
We have an existing set of documents (D1-D5) that are categorized into Auto, Sports, and Computer.
Document # | Content | Category |
D1 | Saturn Dealer’s Car | Auto |
D2 | Toyota Car Tercel | Auto |
D3 | Baseball Game Play | Sports |
D4 | Pulled Muscle Game | Sports |
D5 | Colored GIFs Root | Computer |
Now the task is to categorize the new D6 and D7 into Auto, Sports, or Computer.
Document # | Content | Category |
D6 | Home Runs Game | ? |
D7 | Car Engine Noises | ? |
In machine learning, the given set of documents used to train the probabilistic model is called the training set.
The problem can be solved by the classification technique of machine learning. There are several machine learning algorithms that can be tried out, including:
- Pipeline
- BernoulliNB
- MultinomialNB
- NearestCentroid
- SGD Classifier
- LinearSVC
- RandomForestClassifier
- KNeighborsClassifier
- PassiveAggressiveClassifier
- Perceptron
- RidgeClassifier
Feel free to try out these algorithms for yourself; I found Multinomial Naive Bayes to be one of the most effective algorithms for this purpose.
In this blog, I will also provide an application of Multinomial Naive Bayes. I recommend going through the following topics to build a strong foundation of this concept.
Applying Multinomial Bayes Classification
Step 1
Calculate prior probabilities. These are the probability of a document being in a specific category from the given set of documents.
P(Category) = (No. of documents classified into the category) divided by (Total number of documents)
P(Auto) = (No of documents classified into Auto)divided by (Total number of documents) = 2/5 = 0.4
P(Sports) = 2/5 = 0.4
P(Computer) = 1/5 = 0.2
Step 2
Calculate Likelihood. Likelihood is the conditional probability of a word occurring in a document given that the document belongs to a particular category.
P(Word/Category) = (Number of occurrence of the word in all the documents from a category+1) divided by (All the words in every document from a category + Total number of unique words in all the documents)
P(Saturn/Auto) = (Number of occurrence of the word “SATURN” in all the documents in “AUTO”+1) divided by (All the words in every document from “AUTO” + Total number of unique words in all the documents)
= (1+1)/(6+13) = 2/19 = 0.105263158
The tables below provide conditional probabilities for each word in Auto, Sports, and Computer.
Auto
Word | # of Occurrences of Word in Auto | Total Words in Auto | Conditional Probability of Given Word in Auto | # of Total Unique Words in All Documents |
Saturn | 1 | 6 | 0.105263158 | 13 |
Dealers | 1 | 6 | 0.105263158 | 13 |
Car | 2 | 6 | 0.157894737 | 13 |
Toyota | 1 | 6 | 0.105263158 | 13 |
Tercel | 1 | 6 | 0.105263158 | 13 |
Baseball | 0 | 6 | 0.052631579 | 13 |
Game | 0 | 6 | 0.052631579 | 13 |
Play | 0 | 6 | 0.052631579 | 13 |
Pulled | 0 | 6 | 0.052631579 | 13 |
Muscle | 0 | 6 | 0.052631579 | 13 |
Colored | 0 | 6 | 0.052631579 | 13 |
GIFs | 0 | 6 | 0.052631579 | 13 |
Root | 0 | 6 | 0.052631579 | 13 |
Home | 0 | 6 | 0.052631579 | 13 |
Runs | 0 | 6 | 0.052631579 | 13 |
Engine | 0 | 6 | 0.052631579 | 13 |
Noises | 0 | 6 | 0.052631579 | 13 |
Sports
Word | # of Occurrences of Word in Sports |
Total Words in Sports | Conditional Probability of Given Word | # of Total Unique Words in All Documents |
Saturn | 0 | 6 | 0.052631579 | 13 |
Dealers | 0 | 6 | 0.052631579 | 13 |
Car | 0 | 6 | 0.052631579 | 13 |
Toyota | 0 | 6 | 0.052631579 | 13 |
Tercel | 0 | 6 | 0.052631579 | 13 |
Baseball | 1 | 6 | 0.105263158 | 13 |
Game | 2 | 6 | 0.157894737 | 13 |
Play | 1 | 6 | 0.105263158 | 13 |
Pulled | 1 | 6 | 0.105263158 | 13 |
Muscle | 1 | 6 | 0.105263158 | 13 |
Colored | 1 | 6 | 0.105263158 | 13 |
GIFs | 1 | 6 | 0.105263158 | 13 |
Root | 1 | 6 | 0.105263158 | 13 |
Home | 0 | 6 | 0.052631579 | 13 |
Runs | 0 | 6 | 0.052631579 | 13 |
Engine | 0 | 6 | 0.052631579 | 13 |
Noises | 0 | 6 | 0.052631579 | 13 |
Computer
Word | # of Occurrences of Word in Computer | Total Words in Computer | Conditional Probability of Given Word in Computer | # of Total Unique Words in All Documents |
Saturn | 0 | 3 | 0.0625 | 13 |
Dealers | 0 | 3 | 0.0625 | 13 |
Car | 0 | 3 | 0.0625 | 13 |
Toyota | 0 | 3 | 0.0625 | 13 |
Tercel | 0 | 3 | 0.0625 | 13 |
Baseball | 0 | 3 | 0.0625 | 13 |
Game | 0 | 3 | 0.0625 | 13 |
Play | 0 | 3 | 0.0625 | 13 |
Pulled | 0 | 3 | 0.0625 | 13 |
Muscle | 0 | 3 | 0.0625 | 13 |
Colored | 1 | 3 | 0.125 | 13 |
GIFs | 1 | 3 | 0.125 | 13 |
Root | 1 | 3 | 0.125 | 13 |
Home | 0 | 3 | 0.0625 | 13 |
Runs | 0 | 3 | 0.0625 | 13 |
Engine | 0 | 3 | 0.0625 | 13 |
Noises | 0 | 3 | 0.0625 | 13 |
Step 3
Calculate P(Category/Document) = P(Category) * P(Word1/Category) * P(Word2/Category) * P(Word3/Category)
P(Auto/D6) = P(Auto) * P(Engine/Auto) * P(Noises/Auto) * P(Car/Auto)
= (0.4) * (0.052631579) * (0.157894737)
= (0.00005831754)
P(Sports/D6) = 0.000174953
P(Computers/D6) = 0.00004882813
The most probable category for D6 to fall into is Sports, because it has the highest probability among its peers.
P(Auto/D7) = 0.00017495262
P(Sports/D7) = 0.0000583175
P(Computers/D7) = 0.00004882813
The most probable category for D7 to fall into is Auto, because it has the highest probability among its peers.
The Multinomial Naive Bayes technique is pretty effective for document classification.
Before concluding, I would recommend exploring following Python Packages, which provide great resources to learn classification techniques along with the implementation of several classification algorithms.
- http://scikit-learn.org/stable/
- http://www.nltk.org/
- href=”http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html” target=”_blank” rel=”noopener noreferrer”>http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
I hope you enjoyed reading this. If you have any questions or queries, please leave a comment below. I highly appreciate your feedback!