advanced
Code Snippet Classifier
Build a classifier that detects programming languages and code patterns from snippets using LSI semantic analysis.
Code Snippet Classifier
Build a classifier that identifies programming languages and detects code patterns (tests, APIs, data processing). Uses LSI to understand code structure semantically, not just through keywords.
What You’ll Learn
- Tokenizing code for classification
- Training on programming language patterns
- Multi-level classification (language + purpose)
- Building a practical code analysis tool
Why This Works
Code has recognizable patterns:
- Syntax markers:
def,function,fn,funcall mean “function definition” - Structural patterns: Indentation, brackets, semicolons
- Domain vocabulary:
describe,it,expectsignal tests - Import patterns:
require,import,use,include
LSI captures these patterns semantically, so it recognizes Ruby even without seeing def if the overall structure matches.
Project Setup
mkdir code_classifier && cd code_classifier
# Gemfile
source 'https://rubygems.org'
gem 'classifier'
Code Tokenizer
Create code_tokenizer.rb:
# Custom tokenizer for source code
class CodeTokenizer
# Patterns that identify languages/constructs
SYNTAX_PATTERNS = {
# Function definitions
ruby_def: /\bdef\s+\w+/,
python_def: /\bdef\s+\w+\s*\(/,
js_function: /\bfunction\s+\w+|const\s+\w+\s*=.*=>/,
go_func: /\bfunc\s+\w+/,
rust_fn: /\bfn\s+\w+/,
# Class/type definitions
class_def: /\bclass\s+[A-Z]\w*/,
struct_def: /\bstruct\s+\w+/,
interface_def: /\binterface\s+\w+/,
# Control flow
if_statement: /\bif\s+/,
for_loop: /\bfor\s+/,
while_loop: /\bwhile\s+/,
match_case: /\bmatch\s+|\bcase\s+/,
# Imports
require_stmt: /\brequire\s+['"]|require_relative/,
import_stmt: /\bimport\s+/,
use_stmt: /\buse\s+/,
include_stmt: /\binclude\s+/,
# Testing
test_describe: /\bdescribe\s+['"]|RSpec\.describe/,
test_it: /\bit\s+['"]|test\s+['"]/,
test_expect: /\bexpect\(|assert[A-Z_]/,
# Type annotations
type_annotation: /:\s*\w+\s*[,\)=]|<\w+>/,
# Comments
line_comment: /#\s|\/\//,
block_comment: /\/\*|"""|'''/,
}
# Language-specific keywords
LANGUAGE_KEYWORDS = {
ruby: %w[end do elsif unless yield puts attr_accessor attr_reader module],
python: %w[elif pass lambda self __init__ print None True False],
javascript: %w[const let var async await null undefined console],
typescript: %w[interface type enum namespace readonly private public],
go: %w[package defer chan goroutine make nil fmt],
rust: %w[let mut impl pub mod crate unsafe Option Result Some None],
java: %w[public private static void extends implements throws final],
cpp: %w[include namespace std cout cin template virtual override],
}
def initialize(code)
@code = code
@tokens = []
end
def tokenize
extract_syntax_patterns
extract_keywords
extract_operators
extract_structure_features
@tokens.join(' ')
end
private
def extract_syntax_patterns
SYNTAX_PATTERNS.each do |name, pattern|
count = @code.scan(pattern).length
count.times { @tokens << name.to_s } if count > 0
end
end
def extract_keywords
words = @code.downcase.scan(/\b[a-z_][a-z0-9_]*\b/)
LANGUAGE_KEYWORDS.each do |lang, keywords|
keywords.each do |kw|
if words.include?(kw)
@tokens << "#{lang}_keyword_#{kw}"
@tokens << "lang_#{lang}"
end
end
end
end
def extract_operators
# Significant operators by language
@tokens << 'op_arrow' if @code.include?('=>') || @code.include?('->')
@tokens << 'op_rocket' if @code.include?('<=>')
@tokens << 'op_pipe' if @code.match?(/\|>|\|/)
@tokens << 'op_double_colon' if @code.include?('::')
@tokens << 'op_triple_equals' if @code.include?('===')
@tokens << 'op_spread' if @code.include?('...')
@tokens << 'op_null_coalesce' if @code.match?(/\?\?|&\./)
end
def extract_structure_features
lines = @code.split("\n")
# Indentation style
if lines.any? { |l| l.start_with?(' ') && !l.start_with?(' ') }
@tokens << 'indent_2space'
elsif lines.any? { |l| l.start_with?(' ') }
@tokens << 'indent_4space'
elsif lines.any? { |l| l.start_with?("\t") }
@tokens << 'indent_tab'
end
# Bracket style
@tokens << 'bracket_curly' if @code.include?('{')
@tokens << 'bracket_significant_whitespace' unless @code.include?('{') || @code.include?(';')
# Line endings
@tokens << 'semicolon_terminated' if @code.count(';') > lines.length / 2
end
end
The Code Classifier
Create code_classifier.rb:
require 'classifier'
require 'json'
require_relative 'code_tokenizer'
class CodeClassifier
def initialize
@language_lsi = Classifier::LSI.new(auto_rebuild: false)
@purpose_knn = Classifier::KNN.new(k: 3, weighted: true)
@languages = []
@purposes = []
end
# Train language detection
def train_language(language, code_samples)
@languages << language.to_s unless @languages.include?(language.to_s)
Array(code_samples).each do |code|
tokenized = CodeTokenizer.new(code).tokenize
@language_lsi.add_item(tokenized, language.to_s)
end
end
# Train purpose detection
def train_purpose(purpose, code_samples)
@purposes << purpose.to_s unless @purposes.include?(purpose.to_s)
Array(code_samples).each do |code|
tokenized = CodeTokenizer.new(code).tokenize
@purpose_knn.add(purpose.to_sym => tokenized)
end
end
def build_index
@language_lsi.build_index
end
# Classify a code snippet
def classify(code)
tokenized = CodeTokenizer.new(code).tokenize
language = @language_lsi.classify(tokenized)
lang_confidence = calculate_language_confidence(tokenized)
purpose_result = @purpose_knn.classify_with_neighbors(tokenized)
purpose = purpose_result[:category]
purpose_confidence = (purpose_result[:confidence] * 100).round(1)
{
language: {
detected: language,
confidence: lang_confidence,
alternatives: get_language_alternatives(tokenized)
},
purpose: {
detected: purpose,
confidence: purpose_confidence
},
tokens_used: tokenized.split.uniq.first(10)
}
end
# Quick language detection
def detect_language(code)
tokenized = CodeTokenizer.new(code).tokenize
@language_lsi.classify(tokenized)
end
# Quick purpose detection
def detect_purpose(code)
tokenized = CodeTokenizer.new(code).tokenize
@purpose_knn.classify(tokenized)
end
def save(path)
build_index
@language_lsi.storage = Classifier::Storage::File.new(path: "#{path}_language.json")
@language_lsi.save
@purpose_knn.storage = Classifier::Storage::File.new(path: "#{path}_purpose.json")
@purpose_knn.save
File.write("#{path}_meta.json", { languages: @languages, purposes: @purposes }.to_json)
end
def self.load(path)
classifier = new
lsi_storage = Classifier::Storage::File.new(path: "#{path}_language.json")
knn_storage = Classifier::Storage::File.new(path: "#{path}_purpose.json")
classifier.instance_variable_set(:@language_lsi, Classifier::LSI.load(storage: lsi_storage))
classifier.instance_variable_set(:@purpose_knn, Classifier::KNN.load(storage: knn_storage))
meta = JSON.parse(File.read("#{path}_meta.json"), symbolize_names: true)
classifier.instance_variable_set(:@languages, meta[:languages])
classifier.instance_variable_set(:@purposes, meta[:purposes])
classifier
end
private
def calculate_language_confidence(tokenized)
result = @language_lsi.classify_with_confidence(tokenized)
((result[1] || 0) * 100).round(1)
end
def get_language_alternatives(tokenized)
proximity = @language_lsi.proximity_array_for_content(tokenized)
return [] if proximity.empty?
# Group by language and get top alternatives
lang_scores = Hash.new { |h, k| h[k] = [] }
proximity.first(10).each do |content, score|
lang = @language_lsi.categories_for(content).first
lang_scores[lang] << score
end
lang_scores
.transform_values { |scores| (scores.sum / scores.length * 100).round(1) }
.sort_by { |_, score| -score }
.first(3)
.map { |lang, score| { language: lang, score: score } }
end
end
Training Data
Create train.rb:
require_relative 'code_classifier'
classifier = CodeClassifier.new
# Ruby samples
classifier.train_language(:ruby, [
<<~RUBY,
class User
attr_accessor :name, :email
def initialize(name, email)
@name = name
@email = email
end
def greet
puts "Hello, #{name}!"
end
end
RUBY
<<~RUBY,
module Enumerable
def my_map
result = []
each { |item| result << yield(item) }
result
end
end
RUBY
<<~RUBY,
require 'json'
def parse_config(path)
JSON.parse(File.read(path), symbolize_names: true)
rescue Errno::ENOENT
{}
end
RUBY
])
# Python samples
classifier.train_language(:python, [
<<~PYTHON,
class User:
def __init__(self, name, email):
self.name = name
self.email = email
def greet(self):
print(f"Hello, {self.name}!")
PYTHON
<<~PYTHON,
import json
from pathlib import Path
def parse_config(path):
try:
return json.loads(Path(path).read_text())
except FileNotFoundError:
return {}
PYTHON
<<~PYTHON,
def fibonacci(n):
if n <= 1:
return n
return fibonacci(n-1) + fibonacci(n-2)
PYTHON
])
# JavaScript samples
classifier.train_language(:javascript, [
<<~JS,
class User {
constructor(name, email) {
this.name = name;
this.email = email;
}
greet() {
console.log(`Hello, ${this.name}!`);
}
}
JS
<<~JS,
const parseConfig = async (path) => {
try {
const data = await fs.readFile(path, 'utf8');
return JSON.parse(data);
} catch (e) {
return {};
}
};
JS
<<~JS,
function fibonacci(n) {
if (n <= 1) return n;
return fibonacci(n - 1) + fibonacci(n - 2);
}
JS
])
# Go samples
classifier.train_language(:go, [
<<~GO,
package main
import "fmt"
type User struct {
Name string
Email string
}
func (u *User) Greet() {
fmt.Printf("Hello, %s!", u.Name)
}
GO
<<~GO,
package config
import (
"encoding/json"
"os"
)
func ParseConfig(path string) (map[string]interface{}, error) {
data, err := os.ReadFile(path)
if err != nil {
return nil, err
}
var config map[string]interface{}
json.Unmarshal(data, &config)
return config, nil
}
GO
])
# Rust samples
classifier.train_language(:rust, [
<<~RUST,
struct User {
name: String,
email: String,
}
impl User {
fn new(name: &str, email: &str) -> Self {
User {
name: name.to_string(),
email: email.to_string(),
}
}
fn greet(&self) {
println!("Hello, {}!", self.name);
}
}
RUST
<<~RUST,
use std::fs;
use serde_json::Value;
fn parse_config(path: &str) -> Result<Value, Box<dyn std::error::Error>> {
let data = fs::read_to_string(path)?;
let config: Value = serde_json::from_str(&data)?;
Ok(config)
}
RUST
])
# Purpose: Test code
classifier.train_purpose(:test, [
<<~TEST,
RSpec.describe User do
describe '#greet' do
it 'returns a greeting message' do
user = User.new('Alice', 'alice@example.com')
expect(user.greet).to eq('Hello, Alice!')
end
end
end
TEST
<<~TEST,
describe('User', () => {
test('greet returns greeting', () => {
const user = new User('Alice', 'alice@example.com');
expect(user.greet()).toBe('Hello, Alice!');
});
});
TEST
<<~TEST,
import pytest
def test_user_greet():
user = User('Alice', 'alice@example.com')
assert user.greet() == 'Hello, Alice!'
TEST
])
# Purpose: API endpoint
classifier.train_purpose(:api, [
<<~API,
get '/users/:id' do
content_type :json
user = User.find(params[:id])
user.to_json
end
post '/users' do
user = User.create(JSON.parse(request.body.read))
status 201
user.to_json
end
API
<<~API,
app.get('/users/:id', async (req, res) => {
const user = await User.findById(req.params.id);
res.json(user);
});
app.post('/users', async (req, res) => {
const user = await User.create(req.body);
res.status(201).json(user);
});
API
])
# Purpose: Data processing
classifier.train_purpose(:data_processing, [
<<~DATA,
users
.filter { |u| u.active? }
.map { |u| { name: u.name, email: u.email } }
.sort_by { |u| u[:name] }
.each { |u| process(u) }
DATA
<<~DATA,
users
.filter(u => u.active)
.map(u => ({ name: u.name, email: u.email }))
.sort((a, b) => a.name.localeCompare(b.name))
.forEach(u => process(u));
DATA
])
# Purpose: Configuration
classifier.train_purpose(:config, [
<<~CONFIG,
Rails.application.configure do
config.cache_classes = true
config.eager_load = true
config.log_level = :info
end
CONFIG
<<~CONFIG,
module.exports = {
entry: './src/index.js',
output: {
path: path.resolve(__dirname, 'dist'),
filename: 'bundle.js'
},
plugins: [new HtmlWebpackPlugin()]
};
CONFIG
])
classifier.build_index
classifier.save('code_classifier')
puts "Trained on #{classifier.instance_variable_get(:@languages).length} languages"
puts "Trained on #{classifier.instance_variable_get(:@purposes).length} purposes"
Using the Classifier
Create classify.rb:
require_relative 'code_classifier'
classifier = CodeClassifier.load('code_classifier')
test_snippets = [
{
label: "Ruby with RSpec",
code: <<~CODE
describe Calculator do
it 'adds two numbers' do
expect(Calculator.add(2, 3)).to eq(5)
end
end
CODE
},
{
label: "Python function",
code: <<~CODE
def process_data(items):
result = []
for item in items:
if item.is_valid():
result.append(transform(item))
return result
CODE
},
{
label: "JavaScript API",
code: <<~CODE
router.get('/api/products', async (req, res) => {
const products = await Product.findAll();
res.json({ data: products });
});
CODE
},
{
label: "Go struct",
code: <<~CODE
type Config struct {
Host string `json:"host"`
Port int `json:"port"`
}
func LoadConfig(path string) (*Config, error) {
data, err := os.ReadFile(path)
if err != nil {
return nil, err
}
var config Config
json.Unmarshal(data, &config)
return &config, nil
}
CODE
},
{
label: "Rust with Result",
code: <<~CODE
fn divide(a: f64, b: f64) -> Result<f64, String> {
if b == 0.0 {
Err("Cannot divide by zero".to_string())
} else {
Ok(a / b)
}
}
CODE
}
]
puts "=" * 70
puts "CODE SNIPPET CLASSIFIER"
puts "=" * 70
test_snippets.each do |snippet|
puts "\n#{"-" * 70}"
puts "Sample: #{snippet[:label]}"
puts snippet[:code].lines.first(5).map { |l| " #{l}" }.join
puts " ..." if snippet[:code].lines.length > 5
puts
result = classifier.classify(snippet[:code])
puts "Language: #{result[:language][:detected]} (#{result[:language][:confidence]}%)"
if result[:language][:alternatives].any?
alts = result[:language][:alternatives].map { |a| "#{a[:language]}=#{a[:score]}%" }.join(", ")
puts " Alternatives: #{alts}"
end
puts "Purpose: #{result[:purpose][:detected]} (#{result[:purpose][:confidence]}%)"
puts "Key tokens: #{result[:tokens_used].join(', ')}"
end
Run it:
ruby train.rb
ruby classify.rb
Output:
======================================================================
CODE SNIPPET CLASSIFIER
======================================================================
----------------------------------------------------------------------
Sample: Ruby with RSpec
describe Calculator do
it 'adds two numbers' do
expect(Calculator.add(2, 3)).to eq(5)
end
end
Language: ruby (85.1%)
Alternatives: ruby=74.5%, javascript=24.1%, python=13.9%
Purpose: test (56.4%)
Key tokens: test_it, test_expect, ruby_keyword_end, lang_ruby, ruby_keyword_do, indent_2space, bracket_significant_whitespace
IDE Integration
Create a simple CLI tool:
#!/usr/bin/env ruby
# detect_language.rb
require_relative 'code_classifier'
classifier = CodeClassifier.load('code_classifier')
# Read from stdin or file
code = ARGV[0] ? File.read(ARGV[0]) : $stdin.read
result = classifier.classify(code)
puts result[:language][:detected]
Usage:
# From file
ruby detect_language.rb mystery_file.txt
# From clipboard (macOS)
pbpaste | ruby detect_language.rb
# Output just the language for scripting
ruby detect_language.rb file.txt # => "ruby"
Extending to More Languages
# Add TypeScript
classifier.train_language(:typescript, [
<<~TS,
interface User {
name: string;
email: string;
}
class UserService {
private users: User[] = [];
async findById(id: string): Promise<User | undefined> {
return this.users.find(u => u.id === id);
}
}
TS
])
# Add Java
classifier.train_language(:java, [
<<~JAVA,
public class User {
private String name;
private String email;
public User(String name, String email) {
this.name = name;
this.email = email;
}
public String getName() {
return name;
}
}
JAVA
])
Best Practices
- More samples = better accuracy: 5-10 samples per language
- Diverse samples: Include different coding styles and patterns
- Clean samples: Remove comments that mention the language name
- Real code: Use actual project code, not artificial examples
Next Steps
- LSI Basics - Deep dive into semantic analysis
- kNN Basics - Understanding nearest neighbors
- TF-IDF Guide - Term weighting for code analysis