LSI (Semantic Analysis)

LSI Advanced

Incremental LSI, SVD tuning, and advanced patterns for high-performance semantic indexing.

LSI Advanced

This guide covers advanced LSI features for production use: incremental updates, SVD tuning, and performance optimization.

Incremental LSI

Standard LSI rebuilds the entire SVD when you add documents—expensive for large indices. Incremental mode uses Brand’s algorithm to add documents in O(k²) time instead of O(mn²):

lsi = Classifier::LSI.new(incremental: true)

# Add initial documents and build the index
lsi.add(tech: ["Ruby is elegant", "Python is popular"])
lsi.build_index

# These use Brand's algorithm—no full rebuild
lsi.add(tech: "Go is fast")
lsi.add(tech: "Rust is safe")

After the first build_index, new documents are projected onto the existing semantic space and the SVD is updated incrementally.

When to Use Incremental Mode

Good for:

Streaming data (logs, feeds, user content)
Growing document collections
Real-time indexing requirements
Memory-constrained environments

Not ideal for:

Small, static document sets (full SVD is fast enough)
When documents change the vocabulary significantly
When you need maximum precision

How It Works

Brand’s algorithm maintains the U matrix (left singular vectors) from the SVD decomposition. When a new document arrives:

Project: Compute how the document maps to existing topics
Residual: Find the component orthogonal to known topics
Update: If there’s a new direction, grow the rank; otherwise, update in place
Truncate: Keep only the top-k singular values

This avoids recomputing the full SVD, making adds ~400x faster for large indices.

Checking Incremental Status

lsi = Classifier::LSI.new(incremental: true, auto_rebuild: false)
lsi.add(dogs: ["Dogs bark", "Puppies play"])

lsi.incremental_enabled?  # => false (not yet built)

lsi.build_index

lsi.incremental_enabled?  # => true (ready for incremental adds)
lsi.current_rank          # => 2 (number of semantic dimensions)

Controlling SVD Rank

The max_rank parameter limits how many semantic dimensions to keep:

# Keep at most 50 dimensions
lsi = Classifier::LSI.new(incremental: true, max_rank: 50)

Lower rank means:

Faster operations
Less memory
More aggressive dimensionality reduction (may lose nuance)

Higher rank means:

Better precision
More memory
Slower incremental updates

Inspecting Singular Values

Use singular_value_spectrum to understand your semantic space:

lsi.build_index

spectrum = lsi.singular_value_spectrum
spectrum.each do |entry|
  puts "Dim #{entry[:dimension]}: #{(entry[:cumulative_percentage] * 100).round}% variance"
end

# Find how many dimensions capture 90% of variance
dims_90 = spectrum.find_index { |e| e[:cumulative_percentage] >= 0.90 }
puts "#{dims_90 + 1} dimensions capture 90% of variance"

This helps tune max_rank—if 20 dimensions capture 95% of variance, setting max_rank: 25 gives good results with minimal overhead.

Mode Management

Enabling Incremental Mode Later

Start without incremental mode, then enable it:

lsi = Classifier::LSI.new(auto_rebuild: false)

# Bulk load
documents.each { |doc| lsi.add_item(doc, :category) }
lsi.build_index

# Switch to incremental for future adds
lsi.enable_incremental_mode!(max_rank: 100)
lsi.build_index(force: true)  # Rebuild to capture U matrix

# Now adds are incremental
lsi.add(category: "New document")

Disabling Incremental Mode

If classification quality degrades, switch back to full rebuilds:

lsi.disable_incremental_mode!

# Next add triggers full SVD
lsi.add(category: "Document requiring full rebuild")

Vocabulary Growth

Incremental mode automatically falls back to full rebuild when vocabulary grows more than 20%. This prevents quality degradation from too many out-of-vocabulary terms:

lsi = Classifier::LSI.new(incremental: true, auto_rebuild: false)
lsi.add(tech: ["Ruby code", "Python code"])
lsi.build_index

lsi.incremental_enabled?  # => true

# Add document with many new words
lsi.add(tech: "Quantum computing uses qubits for superposition entanglement")

# Vocabulary grew significantly—fell back to full rebuild
lsi.incremental_enabled?  # => false

Re-enable with enable_incremental_mode! and build_index(force: true) if needed.

Streaming with Incremental Mode

Combine streaming ingestion with incremental updates for live data:

lsi = Classifier::LSI.new(incremental: true, auto_rebuild: false)

# Initial corpus
File.open('initial_corpus.txt') do |file|
  lsi.train_from_stream(:documents, file, batch_size: 1000)
end
lsi.build_index

# Process live stream incrementally
live_feed.each do |message|
  lsi.add(documents: message.text)

  # Classify in real-time
  category = lsi.classify(message.text)
  route_message(message, category)
end

Periodic Full Rebuilds

For long-running systems, schedule periodic full rebuilds to maintain quality:

class LSIManager
  def initialize
    @lsi = Classifier::LSI.new(incremental: true)
    @adds_since_rebuild = 0
  end

  def add_document(text, category)
    @lsi.add(category => text)
    @adds_since_rebuild += 1

    # Full rebuild every 10,000 documents
    if @adds_since_rebuild >= 10_000
      rebuild!
    end
  end

  def rebuild!
    @lsi.disable_incremental_mode!
    @lsi.build_index(force: true)
    @lsi.enable_incremental_mode!
    @adds_since_rebuild = 0
  end
end

Build Index Cutoff

The cutoff parameter controls how many singular values to keep during SVD:

# Keep top 50% of singular values (more aggressive reduction)
lsi.build_index(0.50)

# Keep top 90% of singular values (preserve more detail)
lsi.build_index(0.90)

# Default is 0.75
lsi.build_index

Lower cutoff = fewer dimensions = faster but less precise.

Performance Comparison

Operation	Standard LSI	Incremental LSI
Initial build	O(mn²)	O(mn²)
Add document	O(mn²) rebuild	O(k²) update
Memory	Term-doc matrix	Term-doc + U matrix
Classification	Same	Same
Search	Same	Same

For a 10,000-document index with 5,000 terms and k=100:

Standard add: ~250ms (full SVD)
Incremental add: ~0.6ms (Brand’s update)

Best Practices

Start with full SVD: Build your initial index without incremental mode for best quality
Enable incremental for growth: Switch to incremental mode after the initial build
Monitor quality: Track classification accuracy; rebuild if it degrades
Tune max_rank: Use singular_value_spectrum to find the right balance
Handle vocabulary growth: Expect automatic fallbacks when content changes significantly
Schedule rebuilds: For production systems, rebuild periodically (daily/weekly)

Next Steps

LSI Basics - Core LSI concepts and API
Streaming Training - Process large datasets efficiently
Persistence - Save and load trained indices