kNN Basics

Instance-based classification with k-Nearest Neighbors for interpretable results.

k-Nearest Neighbors Basics

k-Nearest Neighbors (kNN) is an instance-based classifier that stores examples and classifies new text by finding the most similar ones. Unlike Bayes, there’s no training phase—just add examples and classify.

How It Works

Store examples: Each example is stored with its category
Find neighbors: For new text, find the k most similar stored examples
Vote: The category with the most neighbors wins

kNN uses LSI under the hood for semantic similarity, so “dog” and “puppy” are recognized as related.

Creating a Classifier

require 'classifier'

knn = Classifier::KNN.new(k: 3)  # Use 3 nearest neighbors

Adding Examples

kNN supports two equivalent methods for adding labeled examples: train (consistent with Bayes and LogisticRegression) and add (kNN-specific terminology).

# Using train (consistent API across all classifiers)
knn.train(spam: "Buy now! Limited offer!")
knn.train(ham: "Meeting tomorrow at 3pm")

# Using add (equivalent, kNN-specific terminology)
knn.add(spam: "Buy now! Limited offer!")
knn.add(ham: "Meeting tomorrow at 3pm")

# Dynamic train methods (like Bayes)
knn.train_spam "You've won a million dollars!"
knn.train_ham "Thanks for your email"

# Add multiple examples at once
knn.train(
  spam: ["Click here for free stuff", "Limited time offer!"],
  ham: ["Please review the document", "See you tomorrow"]
)

The train method is an alias for add, so they work identically. Use train when you want consistent code that works with any classifier type.

Classification

# Get the best category
result = knn.classify "Congratulations! Claim your prize!"
# => "spam"

# Get detailed results with neighbors
result = knn.classify_with_neighbors "Free money offer"

result[:category]    # => "spam"
result[:confidence]  # => 0.85
result[:neighbors]   # => [{item: "Buy now!...", category: "spam", similarity: 0.92}, ...]
result[:votes]       # => {"spam" => 2.0, "ham" => 1.0}

Distance-Weighted Voting

By default, each neighbor gets one vote. With weighted voting, closer neighbors have more influence:

knn = Classifier::KNN.new(k: 5, weighted: true)

knn.add(
  positive: ["Great product!", "Loved it!", "Excellent service"],
  negative: ["Terrible experience", "Would not recommend"]
)

# Closer neighbors influence the result more
knn.classify "This was amazing!"
# => "positive"

Updating the Classifier

# Add more examples anytime
knn.add(neutral: "It was okay, nothing special")

# Remove specific examples
knn.remove_item "Buy now! Limited offer!"

# Adjust k
knn.k = 7

# List all categories
knn.categories
# => ["spam", "ham", "neutral"]

When to Use kNN

Good for:

Small to medium datasets (<1000 examples)
When you need interpretable results (see which examples influenced the decision)
Incremental learning (easy to add/remove examples)
Multi-label classification

Not ideal for:

Large datasets (stores all examples, compares against all during classification)
When speed is critical
Very high-dimensional feature spaces

Training Data Size Limits

kNN uses LSI internally for semantic similarity, which means it inherits LSI’s O(n²) scaling problem. You cannot throw thousands of examples at kNN like you can with Bayes or Logistic Regression.

Why kNN Doesn’t Scale

# Bayes: O(n) training - each document updates word counts independently
bayes = Classifier::Bayes.new("Spam", "Ham")
50_000.times { |i| bayes.train("Spam", documents[i]) }  # Fast!

# kNN: O(n²) - must rebuild LSI index, then compare all examples
knn = Classifier::KNN.new(k: 5)
50_000.times { |i| knn.add(spam: documents[i]) }  # Extremely slow!

Practical Limits

Examples	Index Build	Classification	Notes
200	~1s	Fast	Good for prototyping
500	~15s	Fast	Sweet spot for many apps
1,000	~60s	Slower	Approaching limits
2,000+	Minutes	Much slower	Use Bayes instead

Making the Most of Limited Data

Since you can’t use massive datasets, focus on quality:

knn = Classifier::KNN.new(k: 5, weighted: true)

# Good: Diverse, representative examples
knn.add(
  spam: [
    "Buy now! Limited offer!",           # Urgency spam
    "You've won $1,000,000!",             # Prize spam
    "Click here for free stuff",          # Clickbait spam
    "Enlarge your... portfolio today"     # Sketchy spam
  ],
  ham: [
    "Meeting tomorrow at 3pm",            # Calendar
    "Please review the attached doc",     # Work request
    "Thanks for your help yesterday",     # Gratitude
    "Running 10 min late"                 # Casual update
  ]
)

# Bad: Redundant examples that don't add information
knn.add(spam: [
  "Buy now!", "Buy today!", "Buy immediately!",  # All same pattern
  "Limited offer", "Limited time", "Limited deal" # All same pattern
])

When to Use Bayes Instead

If you have more than ~1,000 training examples, Bayes is usually better:

# For large datasets, Bayes trains in seconds and classifies instantly
if training_data.size > 1000
  classifier = Classifier::Bayes.new("Spam", "Ham")
else
  # kNN shines with smaller datasets where you need interpretability
  classifier = Classifier::KNN.new(k: 5, weighted: true)
end

training_data.each { |cat, text| classifier.train(cat => text) }

See LSI Basics: Training Data Size Limits for benchmarking techniques to find your optimal dataset size.

kNN vs Bayes vs LSI

Feature	kNN	Bayes	LSI
Storage	All examples	Word counts only	All examples
Best size	<1000 examples	Any size	<1000 documents
Interpretable	Yes (see neighbors)	No	No
Speed	Slower	Very fast	Medium

Why the size difference? Bayes stores aggregate statistics—adding 10,000 documents just increments counters. kNN stores every example and compares against all of them during classification.

Choosing k

Small k (3-5): More sensitive to noise, but captures local patterns
Large k (10+): More stable, but may miss subtle distinctions
Rule of thumb: Start with k = sqrt(n) where n is your dataset size
Odd k: Avoids ties in binary classification

Example: Product Review Classifier

knn = Classifier::KNN.new(k: 5, weighted: true)

# Add training examples
knn.add(
  positive: [
    "Amazing quality, exceeded expectations!",
    "Best purchase I've made this year",
    "Fast shipping and great customer service"
  ],
  negative: [
    "Complete waste of money",
    "Broke after one week",
    "Terrible quality, avoid"
  ],
  neutral: [
    "It's okay, nothing special",
    "Does what it says, no more no less",
    "Average product for the price"
  ]
)

# Classify with explanation
result = knn.classify_with_neighbors "Excellent product, highly recommend!"

puts "Category: #{result[:category]}"
puts "Confidence: #{(result[:confidence] * 100).round}%"
puts "Top neighbors:"
result[:neighbors].first(3).each do |n|
  puts "  - #{n[:category]}: #{n[:item][0..50]}... (#{(n[:similarity] * 100).round}%)"
end

Streaming & Batch Training

For larger datasets, use batch training with progress callbacks:

knn = Classifier::KNN.new(k: 5)

# Batch training with progress tracking
knn.train_batch(
  spam: spam_documents,
  ham: ham_documents,
  batch_size: 500
) do |progress|
  puts "#{progress.percent}% complete (#{progress.rate.round} docs/sec)"
end

For files too large to load into memory, stream line-by-line:

File.open('spam_corpus.txt', 'r') do |file|
  knn.train_from_stream(:spam, file, batch_size: 500) do |progress|
    puts "Processed #{progress.completed} lines"
  end
end

See the Streaming Training Tutorial for checkpoints and resumable training.

Persistence

Save and load your classifier:

# Save to file
knn.save_to_file("reviews_classifier.json")

# Load later
loaded = Classifier::KNN.load_from_file("reviews_classifier.json")

# Or use storage backends
knn.storage = Classifier::Storage::File.new(path: "classifier.json")
knn.save

API Consistency

kNN shares a consistent API with Bayes and LogisticRegression, making it easy to swap classifiers:

Method	Bayes	LogisticRegression	kNN
`train(cat: text)`	✓	✓	✓
`train_category(text)`	✓	✓	✓
`classify(text)` → `String`	✓	✓	✓
`categories` → `Array[String]`	✓	✓	✓

This means you can write classifier-agnostic code:

def train_classifier(classifier, data)
  data.each { |category, texts| classifier.train(category => texts) }
end

def evaluate(classifier, test_cases)
  test_cases.count { |text, expected| classifier.classify(text) == expected }
end

# Works with any classifier
[Classifier::Bayes.new(:a, :b), Classifier::KNN.new, Classifier::LogisticRegression.new([:a, :b])].each do |c|
  train_classifier(c, training_data)
  puts evaluate(c, test_data)
end

Next Steps

Streaming Training - Train on large datasets with progress tracking
Persistence - Save and load classifiers
LSI Basics - Understand the similarity engine behind kNN