TF-IDF Basics
Transform text into weighted feature vectors with Term Frequency-Inverse Document Frequency.
TF-IDF Basics
TF-IDF (Term Frequency-Inverse Document Frequency) transforms text into numerical feature vectors. It’s the foundation for most classic text classification and is useful for feature extraction, document similarity, and search.
How It Works
TF-IDF combines two metrics:
- Term Frequency (TF): How often a word appears in a document
- Inverse Document Frequency (IDF): How rare a word is across all documents
Words that appear frequently in one document but rarely in others get high scores. Common words like “the” get low scores because they appear everywhere.
TF-IDF = TF × IDF
Creating a Vectorizer
require 'classifier'
tfidf = Classifier::TFIDF.new
Fitting and Transforming
The vectorizer needs to learn the vocabulary from your corpus first:
# Fit: learn vocabulary and IDF weights
tfidf.fit([
"Dogs are great pets",
"Cats are independent",
"Birds can fly"
])
# Transform: convert new text to TF-IDF vector
vector = tfidf.transform("Dogs are loyal")
# => {:dog=>0.7071..., :loyal=>0.7071...}
# Fit and transform in one step
vectors = tfidf.fit_transform(documents)
Understanding the Output
The transform method returns a hash of stemmed terms to TF-IDF weights:
vector = tfidf.transform("Dogs are loyal pets")
# => {:dog=>0.5, :loyal=>0.7, :pet=>0.5}
- Keys are stemmed words (e.g., “dogs” → :dog)
- Values are L2-normalized TF-IDF weights
- Common words (stopwords) are filtered out
- The vector magnitude is always 1.0
Configuration Options
Vocabulary Filtering
Filter terms by how often they appear across documents:
tfidf = Classifier::TFIDF.new(
min_df: 2, # Must appear in at least 2 documents
max_df: 0.95 # Must appear in at most 95% of documents
)
Use integers for absolute counts, floats for proportions:
min_df: 5 # At least 5 documents
min_df: 0.01 # At least 1% of documents
max_df: 100 # At most 100 documents
max_df: 0.90 # At most 90% of documents
Sublinear TF Scaling
Use logarithmic term frequency to reduce the impact of very frequent terms:
tfidf = Classifier::TFIDF.new(sublinear_tf: true)
# Uses 1 + log(tf) instead of raw tf
This helps when a word appearing 10 times shouldn’t be 10x more important than appearing once.
N-grams
Extract word pairs (bigrams) or longer sequences:
# Unigrams and bigrams
tfidf = Classifier::TFIDF.new(ngram_range: [1, 2])
tfidf.fit(["quick brown fox", "lazy brown dog"])
tfidf.vocabulary.keys
# => [:quick, :brown, :fox, :lazi, :dog, :quick_brown, :brown_fox, :lazi_brown, :brown_dog]
# Bigrams only
tfidf = Classifier::TFIDF.new(ngram_range: [2, 2])
# Unigrams through trigrams
tfidf = Classifier::TFIDF.new(ngram_range: [1, 3])
Inspecting the Vectorizer
tfidf.fit(documents)
tfidf.vocabulary # => {:dog=>0, :cat=>1, :bird=>2, ...}
tfidf.idf # => {:dog=>1.405, :cat=>1.405, ...}
tfidf.feature_names # => [:dog, :cat, :bird, ...] (in index order)
tfidf.num_documents # => 3
tfidf.fitted? # => true
Streaming from Files
For large corpora that don’t fit in memory, fit from a file stream:
tfidf = Classifier::TFIDF.new
# Fit vocabulary from stream (one document per line)
File.open('corpus.txt', 'r') do |file|
tfidf.fit_from_stream(file, batch_size: 1000) do |progress|
puts "Processed #{progress.completed} documents (#{progress.rate.round}/sec)"
end
end
# Now transform new documents
vector = tfidf.transform("new document text")
The streaming API processes the file line-by-line, building the vocabulary and IDF weights without loading the entire corpus into memory.
See the Streaming Training Tutorial for more details on streaming and progress tracking.
When to Use TF-IDF
Good for:
- Feature extraction for machine learning
- Document similarity and search
- Keyword extraction
- Text preprocessing for other classifiers
Not ideal for:
- When word order matters (use n-grams or other methods)
- Very short texts (tweets, titles)
- When you need semantic understanding (use LSI instead)
Example: Document Similarity
tfidf = Classifier::TFIDF.new
documents = [
"Ruby is a programming language",
"Python is also a programming language",
"Dogs are great pets",
"Cats are independent animals"
]
vectors = tfidf.fit_transform(documents)
# Calculate cosine similarity between documents
def cosine_similarity(v1, v2)
shared_keys = v1.keys & v2.keys
return 0.0 if shared_keys.empty?
shared_keys.sum { |k| v1[k] * v2[k] }
end
# Compare first two documents (both about programming)
similarity = cosine_similarity(vectors[0], vectors[1])
# => ~0.7 (high similarity)
# Compare programming doc with pets doc
similarity = cosine_similarity(vectors[0], vectors[2])
# => ~0.0 (no similarity)
Example: Keyword Extraction
tfidf = Classifier::TFIDF.new(sublinear_tf: true)
# Fit on your corpus
tfidf.fit(all_documents)
# Extract keywords from a specific document
vector = tfidf.transform(target_document)
# Top 5 keywords by TF-IDF weight
keywords = vector.sort_by { |_, weight| -weight }.first(5).map(&:first)
Serialization
Save and load your fitted vectorizer:
# Save to JSON
json = tfidf.to_json
File.write("vectorizer.json", json)
# Load from JSON
loaded = Classifier::TFIDF.from_json(File.read("vectorizer.json"))
loaded.transform("new document")
# Or use Marshal
data = Marshal.dump(tfidf)
loaded = Marshal.load(data)
Using with Other Classifiers
TF-IDF vectors can be used as features for other classifiers:
# Extract features
tfidf = Classifier::TFIDF.new(min_df: 2, sublinear_tf: true)
tfidf.fit(training_documents)
# Use vectors as input to your classifier
training_vectors = training_documents.map { |doc| tfidf.transform(doc) }
test_vector = tfidf.transform(new_document)
Next Steps
- Streaming Training - Train on large datasets with progress tracking
- LSI Basics - Semantic analysis using SVD
- Persistence - Save and load fitted vectorizers