intermediate
Duplicate Ticket Detector
Build a system to detect duplicate support tickets using LSI semantic similarity and TF-IDF weighting.
Duplicate Ticket Detector
Support teams waste hours on duplicate tickets. Build a detector that finds semantically similar tickets—even when they use different words to describe the same problem.
What You’ll Learn
- Combining LSI and TF-IDF for better matching
- Finding similar documents with confidence thresholds
- Building a practical support tool
Why LSI + TF-IDF?
- LSI finds semantic similarity (“can’t login” ≈ “authentication failed”)
- TF-IDF weights important terms higher (error codes, product names)
- Together they catch duplicates that keyword matching would miss
Project Setup
mkdir duplicate_detector && cd duplicate_detector
# Gemfile
source 'https://rubygems.org'
gem 'classifier'
The Duplicate Detector
Create duplicate_detector.rb:
require 'classifier'
require 'json'
class DuplicateDetector
SIMILARITY_THRESHOLD = 0.6 # 60% similar = likely duplicate
def initialize
@lsi = Classifier::LSI.new(auto_rebuild: false)
@tickets = {} # id => ticket data
@tfidf = Classifier::TFIDF.new(min_df: 1, sublinear_tf: true)
@corpus = []
end
# Add a resolved/existing ticket
def add_ticket(id, subject:, body:, status: 'open')
content = "#{subject} #{body}"
@tickets[id] = {
id: id,
subject: subject,
body: body,
status: status,
content: content
}
@corpus << content
@lsi.add_item(content, id)
end
# Rebuild index after batch additions
def rebuild_index
@lsi.build_index
@tfidf.fit(@corpus) if @corpus.any?
end
# Check if a new ticket is a duplicate
def find_duplicates(subject:, body:, limit: 5)
content = "#{subject} #{body}"
# Get LSI similarity scores
similar = @lsi.find_related(content, limit * 2)
return [] if similar.empty?
# Calculate combined scores with TF-IDF boost
new_vector = @tfidf.fitted? ? @tfidf.transform(content) : {}
results = similar.filter_map do |ticket_content|
ticket_id = @lsi.categories_for(ticket_content).first
ticket = @tickets[ticket_id]
next unless ticket
# Base LSI similarity
lsi_score = calculate_lsi_similarity(content, ticket_content)
# TF-IDF boost for shared important terms
tfidf_boost = calculate_tfidf_overlap(new_vector, ticket[:content])
# Combined score (weighted average)
combined_score = (lsi_score * 0.7) + (tfidf_boost * 0.3)
next if combined_score < SIMILARITY_THRESHOLD
{
ticket_id: ticket_id,
subject: ticket[:subject],
status: ticket[:status],
similarity: (combined_score * 100).round(1),
lsi_score: (lsi_score * 100).round(1),
tfidf_boost: (tfidf_boost * 100).round(1),
excerpt: ticket[:body][0..100]
}
end
results.sort_by { |r| -r[:similarity] }.first(limit)
end
# Quick check: is this likely a duplicate?
def duplicate?(subject:, body:, threshold: SIMILARITY_THRESHOLD)
duplicates = find_duplicates(subject: subject, body: body, limit: 1)
duplicates.any? && duplicates.first[:similarity] >= (threshold * 100)
end
def save(path)
rebuild_index
data = {
tickets: @tickets,
corpus: @corpus
}
File.write(path, data.to_json)
end
def self.load(path)
detector = new
data = JSON.parse(File.read(path), symbolize_names: true)
data[:tickets].each do |id, ticket|
detector.add_ticket(
id.to_s,
subject: ticket[:subject],
body: ticket[:body],
status: ticket[:status]
)
end
detector.rebuild_index
detector
end
private
def calculate_lsi_similarity(content1, content2)
# Use LSI's proximity calculation
related = @lsi.proximity_array_for_content(content1)
match = related.find { |item, _| item == content2 }
match ? match[1] : 0.0
end
def calculate_tfidf_overlap(vector1, content2)
return 0.0 unless @tfidf.fitted?
vector2 = @tfidf.transform(content2)
return 0.0 if vector1.empty? || vector2.empty?
# Cosine similarity of TF-IDF vectors
shared_terms = vector1.keys & vector2.keys
return 0.0 if shared_terms.empty?
dot_product = shared_terms.sum { |term| vector1[term] * vector2[term] }
dot_product # Already normalized
end
end
Loading Historical Tickets
Create seed_tickets.rb:
require_relative 'duplicate_detector'
detector = DuplicateDetector.new
# Historical support tickets
tickets = [
{
id: "TICK-001",
subject: "Cannot login to my account",
body: "I'm trying to login but it says invalid password. I've reset it twice already. Using Chrome on Mac.",
status: "resolved"
},
{
id: "TICK-002",
subject: "Payment failed",
body: "My credit card was charged but the order shows as failed. Transaction ID: TXN-12345.",
status: "resolved"
},
{
id: "TICK-003",
subject: "App crashes on startup",
body: "iOS app version 2.3.1 crashes immediately after splash screen. iPhone 14, iOS 17.",
status: "resolved"
},
{
id: "TICK-004",
subject: "Password reset not working",
body: "The reset password email never arrives. Checked spam folder. Email is correct.",
status: "resolved"
},
{
id: "TICK-005",
subject: "Charged twice for subscription",
body: "I see two charges on my credit card for this month's subscription. Please refund one.",
status: "resolved"
},
{
id: "TICK-006",
subject: "Mobile app force closes",
body: "Android app keeps crashing when I try to open it. Pixel 7, Android 14. Tried reinstalling.",
status: "open"
},
{
id: "TICK-007",
subject: "Can't access account - authentication error",
body: "Getting 'authentication failed' error when trying to sign in. Password is definitely correct.",
status: "open"
},
{
id: "TICK-008",
subject: "Export feature not working",
body: "When I click export to CSV, nothing happens. No download, no error. Firefox browser.",
status: "resolved"
},
{
id: "TICK-009",
subject: "Billing shows wrong amount",
body: "My invoice shows $99 but my plan is $49/month. Started this billing cycle.",
status: "open"
},
{
id: "TICK-010",
subject: "Two-factor authentication locked out",
body: "Lost my phone with authenticator app. Can't login now. Need to disable 2FA.",
status: "resolved"
},
]
tickets.each do |ticket|
detector.add_ticket(
ticket[:id],
subject: ticket[:subject],
body: ticket[:body],
status: ticket[:status]
)
puts "Added: #{ticket[:id]} - #{ticket[:subject]}"
end
detector.save('tickets.json')
puts "\nSaved #{tickets.length} tickets to tickets.json"
Checking for Duplicates
Create check_duplicate.rb:
require_relative 'duplicate_detector'
detector = DuplicateDetector.load('tickets.json')
# New incoming tickets to check
new_tickets = [
{
subject: "Login not working",
body: "I can't sign into my account. Says wrong password but I know it's right."
},
{
subject: "App won't open on iPhone",
body: "Just updated to iOS 17 and now the app crashes on launch. iPhone 14 Pro."
},
{
subject: "Need to update credit card",
body: "How do I change my payment method? Want to use a different card."
},
{
subject: "Double charged this month",
body: "Seeing duplicate subscription charges on my bank statement."
}
]
new_tickets.each do |ticket|
puts "=" * 60
puts "NEW TICKET: #{ticket[:subject]}"
puts "-" * 60
duplicates = detector.find_duplicates(
subject: ticket[:subject],
body: ticket[:body],
limit: 3
)
if duplicates.empty?
puts "✓ No duplicates found - this appears to be a new issue"
else
puts "⚠ Potential duplicates found:\n\n"
duplicates.each do |dup|
puts " #{dup[:ticket_id]} (#{dup[:similarity]}% similar)"
puts " Subject: #{dup[:subject]}"
puts " Status: #{dup[:status]}"
puts " Scores: LSI=#{dup[:lsi_score]}%, TF-IDF=#{dup[:tfidf_boost]}%"
puts " Preview: #{dup[:excerpt]}..."
puts
end
end
puts
end
Run it:
ruby seed_tickets.rb
ruby check_duplicate.rb
Output:
============================================================
NEW TICKET: Login not working
------------------------------------------------------------
⚠ Potential duplicates found:
TICK-007 (84.2% similar)
Subject: Can't access account - authentication error
Status: open
Scores: LSI=82.1%, TF-IDF=89.3%
Preview: Getting 'authentication failed' error when trying to sign in...
TICK-001 (76.8% similar)
Subject: Cannot login to my account
Status: resolved
Scores: LSI=78.4%, TF-IDF=72.1%
Preview: I'm trying to login but it says invalid password...
============================================================
NEW TICKET: App won't open on iPhone
------------------------------------------------------------
⚠ Potential duplicates found:
TICK-003 (89.1% similar)
Subject: App crashes on startup
Status: resolved
Scores: LSI=91.2%, TF-IDF=84.5%
Preview: iOS app version 2.3.1 crashes immediately after splash screen...
Integration with Support System
class SupportTicket
attr_accessor :subject, :body, :duplicate_of
def self.detector
@detector ||= DuplicateDetector.load('tickets.json')
end
def check_duplicates
self.class.detector.find_duplicates(
subject: subject,
body: body
)
end
def likely_duplicate?
duplicates = check_duplicates
duplicates.any? && duplicates.first[:similarity] > 75
end
def auto_link_duplicate!
duplicates = check_duplicates
if duplicates.any? && duplicates.first[:similarity] > 85
self.duplicate_of = duplicates.first[:ticket_id]
true
else
false
end
end
end
# In your ticket creation flow:
ticket = SupportTicket.new
ticket.subject = params[:subject]
ticket.body = params[:body]
if ticket.likely_duplicate?
# Show warning to user or agent
duplicates = ticket.check_duplicates
flash[:warning] = "This may be a duplicate of #{duplicates.first[:ticket_id]}"
end
Tuning the Detector
Adjust thresholds based on your needs:
# More aggressive duplicate detection (fewer false negatives)
SIMILARITY_THRESHOLD = 0.5 # 50%
# More conservative (fewer false positives)
SIMILARITY_THRESHOLD = 0.75 # 75%
# Adjust LSI vs TF-IDF weighting
combined_score = (lsi_score * 0.5) + (tfidf_boost * 0.5) # Equal weight
combined_score = (lsi_score * 0.9) + (tfidf_boost * 0.1) # Mostly semantic
Best Practices
- Rebuild index periodically: After adding many tickets, call
rebuild_index - Include resolved tickets: They help catch duplicates of known issues
- Tune thresholds: Start conservative (75%+), lower if missing duplicates
- Human review: Use as a suggestion tool, not automatic merging
Next Steps
- LSI Basics - Deep dive into semantic similarity
- TF-IDF Guide - Understanding term weighting