beginner
Keyword Extractor
Build a keyword extraction tool that identifies the most important terms in documents using TF-IDF weighting.
Keyword Extractor
Build a tool that automatically extracts the most important keywords from documents. TF-IDF identifies terms that are distinctive to each document—perfect for SEO, content tagging, and document summarization.
What You’ll Learn
- Using TF-IDF for keyword extraction
- Comparing documents by their key terms
- Building a practical content analysis tool
Why TF-IDF for Keywords?
TF-IDF naturally surfaces important terms:
- High TF-IDF: Words that appear often in this document but rarely in others
- Low TF-IDF: Common words that appear everywhere (filtered out)
This means “machine learning” in a tech article gets a high score, while “the” and “is” get near zero.
Project Setup
mkdir keyword_extractor && cd keyword_extractor
# Gemfile
source 'https://rubygems.org'
gem 'classifier'
The Keyword Extractor
Create keyword_extractor.rb:
require 'classifier'
require 'json'
class KeywordExtractor
def initialize(corpus = [])
@tfidf = Classifier::TFIDF.new(
min_df: 1,
max_df: 0.85, # Ignore terms in >85% of docs
sublinear_tf: true # Dampen high-frequency terms
)
@corpus = corpus
@fitted = false
end
# Learn vocabulary from a corpus
def fit(documents)
@corpus = documents
@tfidf.fit(documents)
@fitted = true
self
end
# Add documents to corpus and refit
def add_documents(documents)
@corpus += Array(documents)
@tfidf.fit(@corpus)
self
end
# Extract top keywords from a document
def extract(document, top_n: 10)
ensure_fitted!
vector = @tfidf.transform(document)
return [] if vector.empty?
vector
.sort_by { |_, score| -score }
.first(top_n)
.map { |term, score| { term: term, score: score.round(4) } }
end
# Extract keywords with context (surrounding text)
def extract_with_context(document, top_n: 10)
keywords = extract(document, top_n: top_n)
keywords.map do |kw|
term = kw[:term].to_s
# Find the term in the original document
context = find_context(document, term)
kw.merge(context: context)
end
end
# Compare two documents by their keywords
def compare(doc1, doc2, top_n: 10)
kw1 = extract(doc1, top_n: top_n).map { |k| k[:term] }
kw2 = extract(doc2, top_n: top_n).map { |k| k[:term] }
shared = kw1 & kw2
unique_to_first = kw1 - kw2
unique_to_second = kw2 - kw1
{
shared: shared,
unique_to_first: unique_to_first,
unique_to_second: unique_to_second,
similarity: shared.length.to_f / (kw1 | kw2).length
}
end
# Generate a keyword cloud (term => weight)
def keyword_cloud(document, top_n: 20)
keywords = extract(document, top_n: top_n)
return {} if keywords.empty?
# Normalize scores to 1-10 scale for cloud sizing
max_score = keywords.first[:score]
min_score = keywords.last[:score]
range = max_score - min_score
keywords.to_h do |kw|
weight = range.zero? ? 5 : ((kw[:score] - min_score) / range * 9 + 1).round
[kw[:term], weight]
end
end
# Find documents similar to a query based on keyword overlap
def find_similar(query, top_n: 5)
ensure_fitted!
query_vector = @tfidf.transform(query)
return [] if query_vector.empty?
similarities = @corpus.map.with_index do |doc, idx|
doc_vector = @tfidf.transform(doc)
sim = cosine_similarity(query_vector, doc_vector)
{ index: idx, document: doc[0..100], similarity: sim.round(4) }
end
similarities
.sort_by { |s| -s[:similarity] }
.reject { |s| s[:similarity] < 0.1 }
.first(top_n)
end
def vocabulary_size
@tfidf.vocabulary.size
end
def save(path)
@tfidf.storage = Classifier::Storage::File.new(path: path)
@tfidf.save
File.write("#{path}.corpus", @corpus.to_json)
end
def self.load(path)
extractor = new
storage = Classifier::Storage::File.new(path: path)
extractor.instance_variable_set(:@tfidf, Classifier::TFIDF.load(storage: storage))
extractor.instance_variable_set(:@corpus, JSON.parse(File.read("#{path}.corpus")))
extractor.instance_variable_set(:@fitted, true)
extractor
end
private
def ensure_fitted!
raise "Must call fit() with a corpus first" unless @fitted
end
def find_context(document, term)
# Find a sentence or phrase containing the term
sentences = document.split(/[.!?]+/)
match = sentences.find { |s| s.downcase.include?(term.downcase) }
match&.strip&.slice(0, 100)
end
def cosine_similarity(vec1, vec2)
shared = vec1.keys & vec2.keys
return 0.0 if shared.empty?
shared.sum { |k| vec1[k] * vec2[k] }
end
end
Building a Corpus
Create train.rb:
require_relative 'keyword_extractor'
# Sample corpus - in production, load from your database
corpus = [
"Ruby is a dynamic programming language focused on simplicity and productivity. It has an elegant syntax that is natural to read and easy to write.",
"Python is a high-level programming language known for its clear syntax and readability. It's widely used in data science, machine learning, and web development.",
"JavaScript is the language of the web, running in browsers and on servers via Node.js. Modern frameworks like React and Vue have made it even more popular.",
"Machine learning algorithms learn from data to make predictions. Deep learning uses neural networks with many layers to learn complex patterns.",
"Web development involves creating websites and web applications. Frontend focuses on user interface while backend handles server logic and databases.",
"Database management systems store and retrieve data efficiently. SQL databases use structured query language while NoSQL databases offer flexible schemas.",
"Cloud computing provides on-demand computing resources over the internet. AWS, Google Cloud, and Azure are the major cloud providers.",
"DevOps practices combine development and operations to improve deployment speed and reliability. CI/CD pipelines automate testing and deployment.",
"Agile methodology emphasizes iterative development and collaboration. Scrum and Kanban are popular frameworks for managing agile projects.",
"API design focuses on creating clear, consistent interfaces for software components. REST and GraphQL are common approaches for web APIs.",
]
extractor = KeywordExtractor.new
extractor.fit(corpus)
extractor.save('extractor.json')
puts "Trained on #{corpus.length} documents"
puts "Vocabulary size: #{extractor.vocabulary_size} terms"
Extracting Keywords
Create extract.rb:
require_relative 'keyword_extractor'
extractor = KeywordExtractor.load('extractor.json')
# Test document
document = <<~DOC
Building machine learning models with Python has become increasingly popular.
Libraries like TensorFlow and PyTorch make it easy to create neural networks
for deep learning applications. Data scientists use these tools for everything
from image recognition to natural language processing.
DOC
puts "=" * 60
puts "KEYWORD EXTRACTION"
puts "=" * 60
puts "\nDocument:"
puts document
puts
# Extract keywords
puts "Top Keywords:"
puts "-" * 40
keywords = extractor.extract(document, top_n: 10)
keywords.each.with_index(1) do |kw, i|
puts "#{i.to_s.rjust(2)}. #{kw[:term].to_s.ljust(20)} (score: #{kw[:score]})"
end
# Keywords with context
puts "\nKeywords in Context:"
puts "-" * 40
extractor.extract_with_context(document, top_n: 5).each do |kw|
puts "#{kw[:term]}: \"...#{kw[:context]}...\""
puts
end
# Keyword cloud
puts "Keyword Cloud (term => size 1-10):"
puts "-" * 40
cloud = extractor.keyword_cloud(document)
cloud.each { |term, size| puts " #{'█' * size} #{term}" }
Run it:
ruby train.rb
ruby extract.rb
Output:
============================================================
KEYWORD EXTRACTION
============================================================
Document:
Building machine learning models with Python has become...
Top Keywords:
----------------------------------------
1. learn (score: 0.3842)
2. machin (score: 0.3842)
3. python (score: 0.3156)
4. neural (score: 0.2891)
5. deep (score: 0.2891)
6. data (score: 0.2234)
7. network (score: 0.2156)
8. librari (score: 0.1987)
9. model (score: 0.1876)
10. process (score: 0.1654)
Keyword Cloud (term => size 1-10):
----------------------------------------
██████████ learn
██████████ machin
████████ python
███████ neural
███████ deep
Comparing Documents
Create compare.rb:
require_relative 'keyword_extractor'
extractor = KeywordExtractor.load('extractor.json')
doc1 = "Ruby on Rails is a web framework that makes it easy to build database-backed web applications following the MVC pattern."
doc2 = "Django is a Python web framework that encourages rapid development. It follows the model-template-view architectural pattern."
doc3 = "Machine learning models can predict outcomes based on historical data. Training requires large datasets and significant computing power."
puts "=" * 60
puts "DOCUMENT COMPARISON"
puts "=" * 60
puts "\nDocument 1: #{doc1[0..60]}..."
puts "Document 2: #{doc2[0..60]}..."
puts
comparison = extractor.compare(doc1, doc2)
puts "Comparison Results:"
puts "-" * 40
puts "Shared keywords: #{comparison[:shared].join(', ')}"
puts "Unique to Doc 1: #{comparison[:unique_to_first].join(', ')}"
puts "Unique to Doc 2: #{comparison[:unique_to_second].join(', ')}"
puts "Similarity: #{(comparison[:similarity] * 100).round(1)}%"
puts "\n" + "=" * 60
puts "\nComparing Doc 1 vs Doc 3 (very different topics):"
comparison2 = extractor.compare(doc1, doc3)
puts "Shared keywords: #{comparison2[:shared].join(', ').then { |s| s.empty? ? '(none)' : s }}"
puts "Similarity: #{(comparison2[:similarity] * 100).round(1)}%"
SEO Keyword Analyzer
class SEOAnalyzer
def initialize(extractor)
@extractor = extractor
end
def analyze(content, target_keywords: [])
extracted = @extractor.extract(content, top_n: 20)
extracted_terms = extracted.map { |k| k[:term].to_s }
{
top_keywords: extracted.first(10),
target_keyword_coverage: check_coverage(extracted_terms, target_keywords),
keyword_density: calculate_density(content, extracted_terms),
suggestions: generate_suggestions(extracted_terms, target_keywords)
}
end
private
def check_coverage(extracted, targets)
targets.map do |target|
stem = target.downcase.stem
found = extracted.any? { |e| e.to_s.include?(stem) || stem.include?(e.to_s) }
{ keyword: target, found: found }
end
end
def calculate_density(content, keywords)
words = content.downcase.split(/\W+/)
total = words.length
keywords.first(5).to_h do |kw|
count = words.count { |w| w.stem == kw.to_s }
[kw, (count.to_f / total * 100).round(2)]
end
end
def generate_suggestions(extracted, targets)
missing = targets.reject do |t|
extracted.any? { |e| e.to_s.include?(t.downcase.stem) }
end
missing.map { |m| "Consider adding more content about '#{m}'" }
end
end
# Usage
analyzer = SEOAnalyzer.new(extractor)
result = analyzer.analyze(
article_content,
target_keywords: ["ruby", "web development", "rails", "tutorial"]
)
puts "Target keyword coverage:"
result[:target_keyword_coverage].each do |kw|
status = kw[:found] ? "✓" : "✗"
puts " #{status} #{kw[:keyword]}"
end
Integration Example
# In a Rails app
class Article < ApplicationRecord
after_save :extract_keywords
def self.extractor
@extractor ||= KeywordExtractor.load('extractor.json')
end
def extract_keywords
keywords = self.class.extractor.extract("#{title} #{body}", top_n: 10)
update_column(:keywords, keywords.map { |k| k[:term] })
end
def similar_articles(limit: 5)
results = self.class.extractor.find_similar("#{title} #{body}", top_n: limit + 1)
# Exclude self and map to articles
results.reject { |r| r[:index] == id }.first(limit)
end
end
N-gram Keywords
Extract multi-word phrases by configuring the TFIDF with bigrams:
class KeywordExtractor
def initialize(corpus = [], ngram_range: [1, 1])
@tfidf = Classifier::TFIDF.new(
min_df: 1,
max_df: 0.85,
sublinear_tf: true,
ngram_range: ngram_range # [1, 2] for unigrams + bigrams
)
@corpus = corpus
@fitted = false
end
# ... rest of class
end
# Enable bigrams for phrase extraction
extractor = KeywordExtractor.new(ngram_range: [1, 2])
extractor.fit(corpus)
# Now extracts phrases like:
# - machine_learning
# - deep_learning
# - neural_network
Best Practices
- Build domain-specific corpus: Keywords are relative to your corpus
- Tune min_df/max_df: Filter out rare typos and overly common terms
- Use sublinear_tf: Prevents a word appearing 10x from dominating
- Consider n-grams: Bigrams capture phrases like “machine learning”
Next Steps
- TF-IDF Guide - Deep dive into TF-IDF
- LSI Basics - Semantic analysis for related content
- Duplicate Detector Tutorial - Combine TF-IDF with LSI