Topic Discovery with TF-IDF and LSI
Automatically discover topics in unlabeled documents using TF-IDF, then use LSI for semantic classification of new content.
Topic Discovery with TF-IDF and LSI
You have thousands of documents but no categories. How do you organize them? This tutorial shows how to use TF-IDF to discover natural topics in your corpus, then feed those topics into LSI for semantic classification.
What You’ll Learn
- Extracting topic signatures using TF-IDF
- Clustering documents by similarity
- Building an LSI index from discovered categories
- Classifying new documents into discovered topics
The Pipeline
Unlabeled Corpus → TF-IDF → Topic Clusters → LSI → Semantic Classification
- TF-IDF identifies which terms make each document distinctive
- Clustering groups similar documents into topics
- LSI learns the semantic relationships between topics
- New documents get classified by semantic similarity
Project Setup
mkdir topic_discovery && cd topic_discovery
# Gemfile
source 'https://rubygems.org'
gem 'classifier'
The Topic Discoverer
Create topic_discoverer.rb:
require 'classifier'
require 'json'
class TopicDiscoverer
attr_reader :topics, :topic_documents
def initialize(min_cluster_size: 2, similarity_threshold: 0.3)
@tfidf = Classifier::TFIDF.new(min_df: 2, sublinear_tf: true)
@lsi = Classifier::LSI.new(auto_rebuild: false)
@min_cluster_size = min_cluster_size
@similarity_threshold = similarity_threshold
@documents = []
@vectors = []
@topics = {} # topic_name => [doc_indices]
@topic_documents = {} # topic_name => [documents]
end
# Step 1: Add documents to analyze
def add_documents(docs)
@documents.concat(docs)
end
# Step 2: Discover topics from the corpus
def discover_topics(num_topics: 5)
return if @documents.empty?
# Fit TF-IDF on entire corpus
@vectors = @tfidf.fit_transform(@documents)
# Find cluster centers using k-means-like approach
clusters = cluster_documents(num_topics)
# Name topics based on top terms
clusters.each_with_index do |doc_indices, i|
next if doc_indices.length < @min_cluster_size
# Get top terms for this cluster
topic_name = generate_topic_name(doc_indices)
@topics[topic_name] = doc_indices
@topic_documents[topic_name] = doc_indices.map { |idx| @documents[idx] }
# Add documents to LSI with topic as category
doc_indices.each do |idx|
@lsi.add_item(@documents[idx], topic_name)
end
end
# Build LSI index
@lsi.build_index
@topics
end
# Step 3: Classify new documents
def classify(text)
return nil if @topics.empty?
# Get LSI classification with confidence (returns [category, confidence])
category, confidence = @lsi.classify_with_confidence(text)
return nil unless category
{
topic: category,
confidence: (confidence * 100).round(1),
sample_docs: @topic_documents[category]&.first(2)
}
end
# Get detailed topic info
def topic_summary
@topics.map do |name, indices|
top_terms = extract_top_terms(indices, 5)
{
name: name,
document_count: indices.length,
top_terms: top_terms,
sample: @documents[indices.first][0..100] + "..."
}
end
end
def save(path)
@lsi.build_index unless @topics.empty?
data = {
documents: @documents,
topics: @topics,
topic_documents: @topic_documents
}
File.write("#{path}.json", data.to_json)
File.write("#{path}.tfidf", @tfidf.to_json)
end
def self.load(path)
discoverer = new
data = JSON.parse(File.read("#{path}.json"), symbolize_names: true)
discoverer.instance_variable_set(:@documents, data[:documents])
discoverer.instance_variable_set(:@topics, data[:topics].transform_keys(&:to_s))
discoverer.instance_variable_set(:@topic_documents, data[:topic_documents].transform_keys(&:to_s))
discoverer.instance_variable_set(:@tfidf, Classifier::TFIDF.from_json(File.read("#{path}.tfidf")))
# Rebuild LSI from saved data
lsi = Classifier::LSI.new(auto_rebuild: false)
data[:topics].each do |topic_name, indices|
indices.each do |idx|
lsi.add_item(data[:documents][idx], topic_name.to_s)
end
end
lsi.build_index
discoverer.instance_variable_set(:@lsi, lsi)
discoverer
end
private
def cluster_documents(k)
return [] if @vectors.empty?
# Initialize cluster centers randomly
indices = (0...@documents.length).to_a.shuffle
centers = indices.first(k).map { |i| @vectors[i] }
clusters = Array.new(k) { [] }
# Simple k-means iteration
3.times do
# Assign documents to nearest cluster
clusters = Array.new(k) { [] }
@vectors.each_with_index do |vec, idx|
best_cluster = 0
best_similarity = -1
centers.each_with_index do |center, cluster_idx|
sim = cosine_similarity(vec, center)
if sim > best_similarity
best_similarity = sim
best_cluster = cluster_idx
end
end
clusters[best_cluster] << idx if best_similarity >= @similarity_threshold
end
# Update centers
centers = clusters.map do |doc_indices|
next centers[0] if doc_indices.empty?
centroid(doc_indices.map { |i| @vectors[i] })
end
end
clusters
end
def generate_topic_name(doc_indices)
top_terms = extract_top_terms(doc_indices, 3)
top_terms.join("-")
end
def extract_top_terms(doc_indices, n)
# Aggregate TF-IDF scores across cluster
term_scores = Hash.new(0.0)
doc_indices.each do |idx|
@vectors[idx].each do |term, score|
term_scores[term] += score
end
end
# Return top n terms
term_scores
.sort_by { |_, score| -score }
.first(n)
.map { |term, _| term.to_s }
end
def calculate_confidence(text, topic)
vector = @tfidf.transform(text)
return 0.0 if vector.empty?
# Average similarity to documents in this topic
topic_vectors = @topics[topic].map { |i| @vectors[i] }
return 0.0 if topic_vectors.empty?
similarities = topic_vectors.map { |tv| cosine_similarity(vector, tv) }
(similarities.sum / similarities.length * 100).round(1)
end
def cosine_similarity(v1, v2)
shared = v1.keys & v2.keys
return 0.0 if shared.empty?
shared.sum { |k| v1[k] * v2[k] }
end
def centroid(vectors)
return {} if vectors.empty?
result = Hash.new(0.0)
vectors.each do |vec|
vec.each { |term, score| result[term] += score }
end
# Normalize
magnitude = Math.sqrt(result.values.sum { |v| v * v })
return result if magnitude.zero?
result.transform_values { |v| v / magnitude }
end
end
Discovering Topics
Create discover.rb:
require_relative 'topic_discoverer'
discoverer = TopicDiscoverer.new(min_cluster_size: 2, similarity_threshold: 0.2)
# Sample corpus - unlabeled documents
documents = [
# Technology cluster
"Ruby on Rails is a web framework for building applications quickly",
"Python Django provides rapid web development with clean design",
"JavaScript React creates interactive user interfaces",
"Node.js enables server-side JavaScript programming",
"TypeScript adds static typing to JavaScript projects",
# Finance cluster
"Stock market indices reached record highs today",
"Investment portfolios should be diversified across sectors",
"Bond yields are inversely related to prices",
"Cryptocurrency trading volumes increased sharply",
"Interest rates affect borrowing costs for businesses",
# Health cluster
"Regular exercise improves cardiovascular health",
"Nutrition plays a key role in disease prevention",
"Sleep quality affects cognitive function and memory",
"Meditation reduces stress and anxiety levels",
"Vaccines provide immunity against infectious diseases",
# Sports cluster
"The championship game drew millions of viewers",
"Team training focuses on strength and conditioning",
"Players signed multi-year contracts worth millions",
"The tournament bracket was released yesterday",
"Coaches emphasized defensive strategies",
]
discoverer.add_documents(documents)
topics = discoverer.discover_topics(num_topics: 4)
puts "Discovered #{topics.length} topics:\n\n"
discoverer.topic_summary.each do |summary|
puts "Topic: #{summary[:name]}"
puts " Documents: #{summary[:document_count]}"
puts " Key terms: #{summary[:top_terms].join(', ')}"
puts " Sample: \"#{summary[:sample]}\""
puts
end
discoverer.save('topics')
puts "Saved to topics.json"
Run it:
ruby discover.rb
Output (results vary based on random initialization):
Discovered 4 topics:
Topic: web-for-provid
Documents: 3
Key terms: web, for, provid
Sample: "Ruby on Rails is a web framework for building applications quickly..."
Topic: javascript
Documents: 3
Key terms: javascript
Sample: "JavaScript React creates interactive user interfaces..."
Topic: diseas-provid
Documents: 2
Key terms: diseas, provid
Sample: "Nutrition plays a key role in disease prevention..."
Topic: million
Documents: 2
Key terms: million
Sample: "The championship game drew millions of viewers..."
Saved to topics.json
Note: Topic names are generated from stemmed terms (e.g., “provid” from “provides/prevention”). With small corpora, clustering may produce mixed or incomplete topics. Larger datasets with more shared vocabulary yield better results.
Classifying New Documents
Create classify.rb:
require_relative 'topic_discoverer'
discoverer = TopicDiscoverer.load('topics')
new_documents = [
"Learning Vue.js for frontend web development",
"Portfolio rebalancing strategies for retirement",
"Marathon training requires proper hydration",
"The playoffs start next week with home advantage",
"Machine learning models require large datasets",
]
puts "Classifying new documents:\n\n"
new_documents.each do |doc|
puts "Document: \"#{doc}\""
result = discoverer.classify(doc)
if result.nil?
puts " No matching topic found"
else
puts " → #{result[:topic]} (#{result[:confidence]}% confidence)"
end
puts
end
Output:
Classifying new documents:
Document: "Learning Vue.js for frontend web development"
→ web-for-provid (100.0% confidence)
Document: "Portfolio rebalancing strategies for retirement"
→ web-for-provid (100.0% confidence)
Document: "Marathon training requires proper hydration"
No matching topic found
Document: "The playoffs start next week with home advantage"
No matching topic found
Note: Classification quality depends on topic coverage. Documents using vocabulary outside the training corpus may not match any topic. This is expected with small training sets.
Refining Topics
Sometimes automatic discovery needs guidance. You can seed topics with example documents:
class TopicDiscoverer
# Add seed documents to guide topic formation
def seed_topic(name, documents)
documents.each do |doc|
@documents << doc
end
# Pre-assign these to the named topic
@topics[name] ||= []
start_idx = @documents.length - documents.length
documents.length.times do |i|
@topics[name] << (start_idx + i)
end
end
end
# Usage
discoverer = TopicDiscoverer.new
discoverer.seed_topic("machine-learning", [
"Neural networks learn patterns from training data",
"Deep learning models require GPU acceleration",
])
discoverer.add_documents(other_documents)
discoverer.discover_topics(num_topics: 5)
Hierarchical Topics
For large corpora, discover topics at multiple levels:
class HierarchicalDiscoverer
def initialize
@root = TopicDiscoverer.new
@subtopics = {}
end
def discover(documents, levels: 2, topics_per_level: 4)
# First level: broad topics
@root.add_documents(documents)
@root.discover_topics(num_topics: topics_per_level)
return if levels < 2
# Second level: subtopics within each broad topic
@root.topic_documents.each do |topic, docs|
next if docs.length < topics_per_level * 2
sub = TopicDiscoverer.new(min_cluster_size: 2)
sub.add_documents(docs)
sub.discover_topics(num_topics: topics_per_level)
@subtopics[topic] = sub
end
end
def classify(text)
# Classify at root level
root_result = @root.classify(text)
return nil unless root_result
# Check for subtopic
if @subtopics[root_result[:topic]]
sub_result = @subtopics[root_result[:topic]].classify(text)
return {
topic: root_result[:topic],
subtopic: sub_result&.dig(:topic),
confidence: root_result[:confidence]
}
end
root_result
end
end
Integration Example
Use discovered topics to organize a document library:
class DocumentLibrary
def initialize
@discoverer = TopicDiscoverer.load('topics')
@documents = {} # id => {content:, topic:, ...}
end
def add(id, content, metadata = {})
# Auto-classify
classification = @discoverer.classify(content)
@documents[id] = {
content: content,
topic: classification&.dig(:topic) || "uncategorized",
confidence: classification&.dig(:confidence) || 0,
metadata: metadata
}
end
def browse_by_topic
@documents.group_by { |_, doc| doc[:topic] }
end
def find_similar(id)
doc = @documents[id]
return [] unless doc
# Find in same topic
@documents.select do |other_id, other|
other_id != id && other[:topic] == doc[:topic]
end
end
end
Tips for Better Topics
- Clean your corpus: Remove boilerplate, headers, footers
- Tune min_df: Higher values (3-5) for cleaner topics
- Adjust cluster count: Start with fewer topics, increase if too broad
- Review and merge: Some topics may need manual merging
- Iterate: Re-run discovery after adding more documents
When to Use This Approach
Good for:
- Organizing large document collections
- Discovering themes in user feedback
- Building taxonomy from scratch
- Content recommendation systems
Consider alternatives when:
- You already have well-defined categories (use Bayes)
- Documents are very short (tweets, titles)
- You need real-time classification of streaming data
Next Steps
- TF-IDF Basics - Understanding term weighting
- LSI Basics - Semantic similarity deep dive
- Duplicate Detector - Combine TF-IDF + LSI for similarity