Logistic Regression Basics
Learn how the Logistic Regression classifier works for accurate, well-calibrated text classification.
Logistic Regression Classification
The Logistic Regression classifier uses gradient descent to learn discriminative decision boundaries between categories. It produces well-calibrated probabilities that always sum to 1.0, making it ideal for confidence-based decision making.
How It Works
Logistic Regression classification works in three steps:
- Training: Accumulate training examples with their word features
- Fitting: Use Stochastic Gradient Descent (SGD) to learn optimal weights
- Classification: Apply learned weights and softmax to get probabilities
The Math (Simplified)
For a document with words and a category C, the score is:
score(C) = bias(C) + Σ(weight(C, word) × count(word))
Scores are converted to probabilities using the softmax function:
P(C) = exp(score(C)) / Σ exp(score(all categories))
This ensures probabilities are always between 0 and 1 and sum to 1.0.
Creating a Classifier
require 'classifier'
# Create with two or more categories
classifier = Classifier::LogisticRegression.new([:spam, :ham])
# With custom hyperparameters
classifier = Classifier::LogisticRegression.new(
[:spam, :ham],
learning_rate: 0.1, # Step size for gradient descent
regularization: 0.01, # L2 regularization strength
max_iterations: 100, # Maximum training iterations
tolerance: 1e-4 # Convergence threshold
)
Training
Train the classifier by providing examples for each category:
# Keyword arguments (recommended)
classifier.train(spam: 'Buy cheap viagra now!!!')
classifier.train(ham: 'Meeting tomorrow at 3pm')
# Batch training with arrays
classifier.train(
spam: ['You won $1M!', 'Free money instantly'],
ham: ['Project update', 'Lunch tomorrow?']
)
# Legacy APIs (still work)
classifier.train :spam, 'Click here for free stuff'
classifier.train_spam 'Limited time offer!'
# Stream training for large datasets
classifier.train_from_stream(:spam, File.open('spam_corpus.txt'), batch_size: 500)
Lazy Fitting
The model is not fitted during training. It automatically fits when you first call classify or probabilities. You can also fit manually:
# Manual fitting (optional)
classifier.fit
# Check if fitted
classifier.fitted? # => true
Classification
# Get the best category
result = classifier.classify 'Claim your free prize now'
# => "Spam"
# Get well-calibrated probabilities (always sum to 1.0)
probs = classifier.probabilities 'Limited time offer'
# => {"Spam" => 0.92, "Ham" => 0.08}
# Get raw log-odds scores
scores = classifier.classifications 'Quarterly review scheduled'
# => {"Spam" => -2.1, "Ham" => 1.4}
Understanding Probabilities
Unlike Naive Bayes, Logistic Regression produces true probabilities:
- Values are always between 0 and 1
- All probabilities sum to exactly 1.0
- Well-calibrated: if the model says 80% confidence, it’s right ~80% of the time
This makes threshold-based decisions reliable:
probs = classifier.probabilities(email_text)
if probs['Spam'] > 0.95
# High confidence - auto-filter
move_to_spam(email)
elsif probs['Spam'] > 0.5
# Medium confidence - flag for review
flag_for_review(email)
else
# Low confidence - deliver normally
deliver(email)
end
Feature Weights
Inspect which words are most predictive for each category:
# Get all weights for a category (sorted by importance)
weights = classifier.weights(:spam)
# => {:free => 2.3, :buy => 1.8, :money => 1.5, :meeting => -1.2, ...}
# Get top 10 most important features
top_features = classifier.weights(:spam, limit: 10)
Weight interpretation:
- Positive weights: Features that support this category
- Negative weights: Features that contradict this category
- Higher absolute value: More predictive power
When to Use Logistic Regression
Good for:
- When you need well-calibrated probabilities
- Confidence-based decision making (threshold filtering)
- When interpretability matters (inspectable weights)
- Multi-class classification
- When accuracy is more important than training speed
Not ideal for:
- Incremental training (requires re-fitting for new data)
- Very large vocabularies (memory for weight matrix)
- When you need untraining support (use Bayes)
- Semantic similarity (use LSI)
Comparison with Other Classifiers
| Feature | Logistic Regression | Naive Bayes | kNN |
|---|---|---|---|
| Training | Batch (accumulate then fit) | Incremental | Instance-based |
| Probabilities | Well-calibrated (sum to 1.0) | Log probabilities | Confidence scores |
| Untraining | Not supported | Supported | Remove instances |
| Speed | Slower training, fast inference | Very fast | Slow inference |
| Interpretability | Feature weights | Word frequencies | Similar neighbors |
Multi-Class Classification
Logistic Regression handles multiple categories naturally:
classifier = Classifier::LogisticRegression.new(
[:tech, :sports, :politics, :entertainment]
)
classifier.train(
tech: ['New iPhone announced', 'Python 4.0 released'],
sports: ['Lakers win championship', 'World Cup finals'],
politics: ['Senate passes bill', 'Election results'],
entertainment: ['Oscar nominations', 'New movie premiere']
)
probs = classifier.probabilities 'Breaking: Major tech company IPO'
# => {"Tech" => 0.72, "Sports" => 0.05, "Politics" => 0.15, "Entertainment" => 0.08}
Thread Safety
The classifier is thread-safe for concurrent access:
# Safe to classify from multiple threads
threads = 10.times.map do |i|
Thread.new do
result = classifier.classify(texts[i])
end
end
threads.each(&:join)
Streaming & Batch Training
For large datasets, use batch training with progress callbacks:
classifier = Classifier::LogisticRegression.new([:spam, :ham])
# Batch training with progress tracking
classifier.train_batch(:spam, spam_documents, batch_size: 1000) do |progress|
puts "#{progress.percent}% complete (#{progress.rate.round} docs/sec)"
end
# Train multiple categories at once
classifier.train_batch(
spam: spam_documents,
ham: ham_documents,
batch_size: 500
) do |progress|
puts "Processed #{progress.completed} documents"
end
# IMPORTANT: Must fit after batch training
classifier.fit
For files too large to load into memory, stream line-by-line:
File.open('spam_corpus.txt', 'r') do |file|
classifier.train_from_stream(:spam, file, batch_size: 1000) do |progress|
puts "Processed #{progress.completed} lines"
end
end
File.open('ham_corpus.txt', 'r') do |file|
classifier.train_from_stream(:ham, file, batch_size: 1000)
end
# Always call fit() after streaming training
classifier.fit
Unlike Bayes, Logistic Regression accumulates training data during streaming and only trains the model when you call fit(). This makes it efficient for large datasets but means you must explicitly fit before classification.
See the Streaming Training Tutorial for checkpoints and resumable training.
Example: Spam Filter with Confidence Levels
spam_filter = Classifier::LogisticRegression.new([:spam, :ham])
# Train with examples
spam_filter.train(
spam: [
'Buy cheap viagra now!!!',
'You won $1 million dollars!',
'Click here for free iPhone',
'Limited time offer - act now!'
],
ham: [
'Meeting tomorrow at 3pm',
'Quarterly report attached',
'Can you review this document?',
'Lunch next week?'
]
)
# Classify with confidence-based handling
def process_email(email)
probs = spam_filter.probabilities(email.body)
case
when probs['Spam'] > 0.95
{ action: :delete, reason: 'High confidence spam' }
when probs['Spam'] > 0.7
{ action: :quarantine, reason: 'Likely spam' }
when probs['Spam'] > 0.4
{ action: :flag, reason: 'Suspicious content' }
else
{ action: :deliver, reason: 'Appears legitimate' }
end
end
# Inspect what the model learned
puts "Top spam indicators:"
spam_filter.weights(:spam, limit: 5).each do |word, weight|
puts " #{word}: #{weight.round(2)}"
end
Next Steps
- Streaming Training - Train on large datasets with progress tracking
- Persistence - Save and load trained classifiers
- Real-time Pipeline - Build a production-ready classification pipeline