Ensemble Classifier
Combine Bayes, LSI, and kNN classifiers with weighted voting for higher accuracy than any single model.
Ensemble Classifier
Combine multiple classifiers into an ensemble that outperforms any individual model. By leveraging the strengths of Bayes (fast, probabilistic), LSI (semantic understanding), and kNN (interpretable neighbors), you get more robust predictions.
What You’ll Learn
- Building an ensemble from multiple classifier types
- Weighted voting strategies
- Confidence-based model selection
- When ensembles help (and when they don’t)
Why Ensembles Work
Different classifiers have different strengths:
| Classifier | Strength | Weakness |
|---|---|---|
| Bayes | Fast, handles large vocab | Assumes word independence |
| LSI | Semantic similarity | Slower, needs tuning |
| kNN | Interpretable, no training | Slower at scale |
When they disagree, the ensemble can break ties intelligently. When they agree, confidence is high.
API Consistency: Bayes, LogisticRegression, and kNN all share the same train() method, making it easy to build ensembles with uniform training code.
Project Setup
mkdir ensemble_classifier && cd ensemble_classifier
# Gemfile
source 'https://rubygems.org'
gem 'classifier'
The Ensemble Classifier
Create ensemble_classifier.rb:
require 'classifier'
require 'json'
class EnsembleClassifier
STRATEGIES = [:majority_vote, :weighted_vote, :confidence_weighted, :best_confidence]
def initialize(strategy: :confidence_weighted)
@bayes = nil
@lsi = nil
@knn = nil
@strategy = strategy
@categories = []
@weights = { bayes: 1.0, lsi: 1.0, knn: 1.0 }
end
attr_accessor :weights
# Train all classifiers with the same data
def train(data_by_category)
@categories = data_by_category.keys.map(&:to_s)
# Initialize classifiers
@bayes = Classifier::Bayes.new(*@categories)
@lsi = Classifier::LSI.new(auto_rebuild: false)
@knn = Classifier::KNN.new(k: 5, weighted: true)
# Train each classifier with consistent API
# All three now support train() with keyword arguments
data_by_category.each do |category, items|
items = Array(items)
@bayes.train(category.to_sym => items)
@lsi.add(category.to_s => items) # LSI uses add (not a classifier per se)
@knn.train(category.to_sym => items) # kNN now supports train() too
end
@lsi.build_index
self
end
# Classify using the ensemble
def classify(text)
predictions = get_all_predictions(text)
result = combine_predictions(predictions)
{
category: result[:category],
confidence: result[:confidence],
strategy: @strategy,
individual_predictions: predictions,
agreement: calculate_agreement(predictions)
}
end
# Get detailed breakdown
def classify_with_details(text)
result = classify(text)
result.merge(
explanation: explain_decision(result),
recommendation: recommend_action(result)
)
end
# Evaluate ensemble vs individual classifiers
def evaluate(test_data)
results = { ensemble: 0, bayes: 0, lsi: 0, knn: 0, total: 0 }
test_data.each do |item|
text = item[:text]
expected = item[:category].to_s
ensemble_result = classify(text)
predictions = ensemble_result[:individual_predictions]
results[:total] += 1
results[:ensemble] += 1 if ensemble_result[:category] == expected
results[:bayes] += 1 if predictions[:bayes][:category] == expected
results[:lsi] += 1 if predictions[:lsi][:category] == expected
results[:knn] += 1 if predictions[:knn][:category] == expected
end
# Calculate accuracies
total = results[:total].to_f
{
ensemble: (results[:ensemble] / total * 100).round(1),
bayes: (results[:bayes] / total * 100).round(1),
lsi: (results[:lsi] / total * 100).round(1),
knn: (results[:knn] / total * 100).round(1),
total_samples: results[:total]
}
end
def save(path)
Dir.mkdir(path) unless Dir.exist?(path)
@bayes.storage = Classifier::Storage::File.new(path: "#{path}/bayes.json")
@bayes.save
@lsi.storage = Classifier::Storage::File.new(path: "#{path}/lsi.json")
@lsi.save
@knn.storage = Classifier::Storage::File.new(path: "#{path}/knn.json")
@knn.save
File.write("#{path}/meta.json", {
strategy: @strategy,
weights: @weights,
categories: @categories
}.to_json)
end
def self.load(path)
meta = JSON.parse(File.read("#{path}/meta.json"), symbolize_names: true)
ensemble = new(strategy: meta[:strategy].to_sym)
ensemble.weights = meta[:weights]
ensemble.instance_variable_set(:@categories, meta[:categories])
bayes_storage = Classifier::Storage::File.new(path: "#{path}/bayes.json")
lsi_storage = Classifier::Storage::File.new(path: "#{path}/lsi.json")
knn_storage = Classifier::Storage::File.new(path: "#{path}/knn.json")
ensemble.instance_variable_set(:@bayes, Classifier::Bayes.load(storage: bayes_storage))
ensemble.instance_variable_set(:@lsi, Classifier::LSI.load(storage: lsi_storage))
ensemble.instance_variable_set(:@knn, Classifier::KNN.load(storage: knn_storage))
ensemble
end
private
def get_all_predictions(text)
{
bayes: get_bayes_prediction(text),
lsi: get_lsi_prediction(text),
knn: get_knn_prediction(text)
}
end
def get_bayes_prediction(text)
category = @bayes.classify(text)
scores = @bayes.classifications(text)
# Convert log probabilities to confidence
exp_scores = scores.transform_values { |s| Math.exp(s) }
total = exp_scores.values.sum
confidence = (exp_scores[category] / total * 100).round(1)
{ category: category, confidence: confidence, scores: scores }
end
def get_lsi_prediction(text)
result = @lsi.classify_with_confidence(text)
category = result[0]&.to_s
confidence = ((result[1] || 0) * 100).round(1)
{ category: category, confidence: confidence }
end
def get_knn_prediction(text)
result = @knn.classify_with_neighbors(text)
category = result[:category]&.to_s
confidence = (result[:confidence] * 100).round(1)
{ category: category, confidence: confidence, neighbors: result[:neighbors] }
end
def combine_predictions(predictions)
case @strategy
when :majority_vote
majority_vote(predictions)
when :weighted_vote
weighted_vote(predictions)
when :confidence_weighted
confidence_weighted(predictions)
when :best_confidence
best_confidence(predictions)
else
raise "Unknown strategy: #{@strategy}"
end
end
def majority_vote(predictions)
votes = predictions.values.map { |p| p[:category] }
winner = votes.group_by(&:itself).max_by { |_, v| v.size }&.first
vote_count = votes.count(winner)
confidence = (vote_count.to_f / votes.size * 100).round(1)
{ category: winner, confidence: confidence }
end
def weighted_vote(predictions)
scores = Hash.new(0.0)
predictions.each do |classifier, pred|
next unless pred[:category]
scores[pred[:category]] += @weights[classifier]
end
winner = scores.max_by { |_, v| v }&.first
total_weight = @weights.values.sum
confidence = (scores[winner] / total_weight * 100).round(1)
{ category: winner, confidence: confidence }
end
def confidence_weighted(predictions)
scores = Hash.new(0.0)
predictions.each do |classifier, pred|
next unless pred[:category]
weight = @weights[classifier] * (pred[:confidence] / 100.0)
scores[pred[:category]] += weight
end
winner = scores.max_by { |_, v| v }&.first
total = scores.values.sum
confidence = total.positive? ? (scores[winner] / total * 100).round(1) : 0
{ category: winner, confidence: confidence }
end
def best_confidence(predictions)
best = predictions.max_by { |_, pred| pred[:confidence] }
{ category: best[1][:category], confidence: best[1][:confidence], chosen_by: best[0] }
end
def calculate_agreement(predictions)
categories = predictions.values.map { |p| p[:category] }.compact
return 0 if categories.empty?
most_common = categories.group_by(&:itself).max_by { |_, v| v.size }
(most_common[1].size.to_f / categories.size * 100).round(1)
end
def explain_decision(result)
preds = result[:individual_predictions]
agreement = result[:agreement]
if agreement == 100
"All classifiers agree on '#{result[:category]}'"
elsif agreement >= 66
"Majority (#{agreement.round}%) agree on '#{result[:category]}'"
else
disagreements = preds.map { |c, p| "#{c}=#{p[:category]}" }.join(", ")
"Classifiers disagree (#{disagreements}), resolved by #{@strategy}"
end
end
def recommend_action(result)
if result[:confidence] >= 80 && result[:agreement] >= 66
:auto_classify
elsif result[:confidence] >= 50
:suggest_with_review
else
:manual_review
end
end
end
Training the Ensemble
Create train.rb:
require_relative 'ensemble_classifier'
ensemble = EnsembleClassifier.new(strategy: :confidence_weighted)
# Training data
training_data = {
tech: [
"New JavaScript framework released for frontend development",
"Python machine learning library updated with GPU support",
"Kubernetes deployment best practices for microservices",
"React hooks tutorial for state management",
"Database optimization techniques for PostgreSQL",
"API design patterns for RESTful services",
"Docker container security best practices",
"TypeScript generics explained with examples",
],
sports: [
"Team wins championship after overtime victory",
"Star player signs record-breaking contract",
"Coach announces new training strategy for season",
"League announces rule changes for next year",
"Athlete breaks world record at competition",
"Team trades draft pick for veteran player",
"Stadium renovations completed before opener",
"Player returns from injury ahead of schedule",
],
finance: [
"Stock market reaches all-time high amid earnings",
"Federal Reserve announces interest rate decision",
"Cryptocurrency volatility concerns investors",
"Company reports quarterly earnings beat expectations",
"Merger announcement drives stock price surge",
"Economic indicators suggest recession concerns",
"Investment strategies for volatile markets",
"Banking sector faces regulatory changes",
],
entertainment: [
"New streaming series breaks viewership records",
"Award show announces nominees for best picture",
"Celebrity announces upcoming concert tour dates",
"Movie sequel announced for summer release",
"Album debuts at top of music charts",
"TV show renewed for additional seasons",
"Director reveals plans for franchise reboot",
"Festival lineup announced with headliners",
]
}
ensemble.train(training_data)
ensemble.save('ensemble_model')
puts "Trained ensemble with #{training_data.keys.length} categories"
puts "Total examples: #{training_data.values.flatten.length}"
Classifying with the Ensemble
Create classify.rb:
require_relative 'ensemble_classifier'
ensemble = EnsembleClassifier.load('ensemble_model')
test_texts = [
"The new React 19 release includes server components and improved hooks",
"Lakers defeat Celtics in thrilling game seven overtime",
"Fed raises rates as inflation concerns persist in economy",
"Oscar nominations announced for best picture category",
"Startup raises funding to build quantum computing platform", # Ambiguous
]
puts "=" * 70
puts "ENSEMBLE CLASSIFIER"
puts "=" * 70
test_texts.each do |text|
puts "\nText: #{text[0..60]}..."
puts "-" * 50
result = ensemble.classify_with_details(text)
puts "Result: #{result[:category]} (#{result[:confidence]}% confidence)"
puts "Agreement: #{result[:agreement]}%"
puts "Explanation: #{result[:explanation]}"
puts "Recommendation: #{result[:recommendation]}"
puts "\nIndividual predictions:"
result[:individual_predictions].each do |classifier, pred|
puts " #{classifier.to_s.ljust(6)}: #{pred[:category]} (#{pred[:confidence]}%)"
end
end
Comparing Strategies
Create compare_strategies.rb:
require_relative 'ensemble_classifier'
# Test data (separate from training!)
test_data = [
{ text: "Python library for data science released", category: "tech" },
{ text: "Team wins playoff series in seven games", category: "sports" },
{ text: "Stock prices fall amid market uncertainty", category: "finance" },
{ text: "New movie breaks box office records", category: "entertainment" },
{ text: "JavaScript framework simplifies web development", category: "tech" },
{ text: "Player traded to rival team for picks", category: "sports" },
{ text: "Central bank holds interest rates steady", category: "finance" },
{ text: "Concert tour announced for summer dates", category: "entertainment" },
# Add more test cases...
]
strategies = [:majority_vote, :weighted_vote, :confidence_weighted, :best_confidence]
puts "=" * 60
puts "STRATEGY COMPARISON"
puts "=" * 60
strategies.each do |strategy|
ensemble = EnsembleClassifier.load('ensemble_model')
ensemble.instance_variable_set(:@strategy, strategy)
accuracy = ensemble.evaluate(test_data)
puts "\n#{strategy}:"
puts " Ensemble: #{accuracy[:ensemble]}%"
puts " vs Bayes: #{accuracy[:bayes]}% | LSI: #{accuracy[:lsi]}% | kNN: #{accuracy[:knn]}%"
end
Tuning Weights
# Give more weight to classifiers that perform better on your domain
ensemble.weights = {
bayes: 1.2, # Boost Bayes (fast, good for distinct categories)
lsi: 0.8, # Lower LSI (if semantic similarity less important)
knn: 1.0 # Keep kNN normal
}
# Or tune based on evaluation
def auto_tune_weights(ensemble, validation_data)
best_weights = ensemble.weights.dup
best_accuracy = ensemble.evaluate(validation_data)[:ensemble]
# Simple grid search
[0.5, 0.8, 1.0, 1.2, 1.5].each do |bayes_w|
[0.5, 0.8, 1.0, 1.2, 1.5].each do |lsi_w|
[0.5, 0.8, 1.0, 1.2, 1.5].each do |knn_w|
ensemble.weights = { bayes: bayes_w, lsi: lsi_w, knn: knn_w }
accuracy = ensemble.evaluate(validation_data)[:ensemble]
if accuracy > best_accuracy
best_accuracy = accuracy
best_weights = ensemble.weights.dup
end
end
end
end
ensemble.weights = best_weights
{ weights: best_weights, accuracy: best_accuracy }
end
When to Use Ensembles
Good for:
- High-stakes classification where accuracy matters
- Ambiguous text that might confuse single classifiers
- When you need confidence scoring for manual review routing
Not ideal for:
- Simple, clear-cut categories (single classifier is enough)
- Latency-sensitive applications (3x the computation)
- Very large scale (memory for 3 models)
Best Practices
- Use validation data for tuning: Don’t tune on training data
- Monitor individual classifier performance: If one is always wrong, lower its weight
- Consider the agreement score: High disagreement = uncertain prediction
- Route low-confidence to humans: Use the recommendation field
Next Steps
- Bayes Basics - Understand probabilistic classification
- LSI Basics - Semantic similarity under the hood
- kNN Basics - Instance-based classification