Bayesian Classifier
Bayes Basics
Understand how the Bayesian classifier works and when to use it.
Bayesian Classification Basics
The Bayesian classifier uses Bayes’ theorem to calculate the probability that a piece of text belongs to each category. It’s simple, fast, and surprisingly effective for many text classification tasks.
How It Works
Naive Bayes classification works in three steps:
- Training: Count word frequencies for each category
- Classification: Calculate probability of each category given the words
- Decision: Return the category with highest probability
The Math (Simplified)
For a document with words w1, w2, w3, the probability of category C is:
P(C | w1, w2, w3) ∝ P(C) × P(w1|C) × P(w2|C) × P(w3|C)
Where:
P(C)is the prior probability of category CP(w|C)is the probability of seeing word w in category C
The “naive” assumption is that words are independent of each other, which isn’t true but works well in practice.
Creating a Classifier
require 'classifier'
# Create with any number of categories
classifier = Classifier::Bayes.new 'Tech', 'Sports', 'Politics'
Training
Train the classifier by providing examples for each category:
# Method 1: Using the train method
classifier.train 'Tech', 'New JavaScript framework released'
classifier.train 'Sports', 'Team wins championship game'
# Method 2: Using dynamic methods (more readable)
classifier.train_tech 'Apple announces new MacBook'
classifier.train_sports 'Soccer player signs new contract'
classifier.train_politics 'Senate passes new legislation'
Training Tips
- More data is better: Accuracy improves significantly with more training examples
- Balance categories: Try to provide similar amounts of data for each category
- Use representative examples: Train with text similar to what you’ll classify
Classification
# Get the best category
result = classifier.classify 'The new iPhone has amazing features'
# => "Tech"
# Get scores for all categories
scores = classifier.classifications 'Congress debates tax reform'
# => {"Tech" => -15.2, "Sports" => -18.4, "Politics" => -8.1}
Understanding Scores
The classifier returns log probabilities:
- Scores are always negative
- Higher (less negative) = more likely
- Differences matter more than absolute values
To convert to relative probabilities:
scores = classifier.classifications(text)
# Normalize to get percentages
max_score = scores.values.max
normalized = scores.transform_values { |s| Math.exp(s - max_score) }
total = normalized.values.sum
percentages = normalized.transform_values { |v| (v / total * 100).round(1) }
When to Use Bayes
Good for:
- Spam detection
- Sentiment analysis (positive/negative)
- Topic categorization
- Language detection
- Any task with clear category boundaries
Not ideal for:
- Finding related documents (use LSI instead)
- Semantic similarity
- When word order matters significantly
Configuration Options
# Enable automatic stemming (on by default)
classifier = Classifier::Bayes.new 'A', 'B', enable_stemmer: true
# Use custom language for stemming
classifier = Classifier::Bayes.new 'A', 'B', language: 'fr'
# Disable threshold (classify everything, even low confidence)
classifier = Classifier::Bayes.new 'A', 'B', enable_threshold: false
Example: Sentiment Analyzer
sentiment = Classifier::Bayes.new 'Positive', 'Negative'
# Train with examples
sentiment.train_positive "I love this product!"
sentiment.train_positive "Excellent service, highly recommend"
sentiment.train_positive "Best purchase I've ever made"
sentiment.train_negative "Terrible experience, avoid"
sentiment.train_negative "Waste of money"
sentiment.train_negative "Disappointing and frustrating"
# Classify new reviews
sentiment.classify "This is amazing!"
# => "Positive"
sentiment.classify "Complete garbage, don't buy"
# => "Negative"
Next Steps
- Training Strategies - Best practices for training data
- Persistence - Save and load trained classifiers
- Performance - Optimize for production use