advanced

Code Snippet Classifier

Build a classifier that detects programming languages and code patterns from snippets using LSI semantic analysis.

Code Snippet Classifier

Build a classifier that identifies programming languages and detects code patterns (tests, APIs, data processing). Uses LSI to understand code structure semantically, not just through keywords.

What You’ll Learn

Tokenizing code for classification
Training on programming language patterns
Multi-level classification (language + purpose)
Building a practical code analysis tool

Why This Works

Code has recognizable patterns:

Syntax markers: def, function, fn, func all mean “function definition”
Structural patterns: Indentation, brackets, semicolons
Domain vocabulary: describe, it, expect signal tests
Import patterns: require, import, use, include

LSI captures these patterns semantically, so it recognizes Ruby even without seeing def if the overall structure matches.

Project Setup

mkdir code_classifier && cd code_classifier

# Gemfile
source 'https://rubygems.org'
gem 'classifier'

Code Tokenizer

Create code_tokenizer.rb:

# Custom tokenizer for source code
class CodeTokenizer
  # Patterns that identify languages/constructs
  SYNTAX_PATTERNS = {
    # Function definitions
    ruby_def: /\bdef\s+\w+/,
    python_def: /\bdef\s+\w+\s*\(/,
    js_function: /\bfunction\s+\w+|const\s+\w+\s*=.*=>/,
    go_func: /\bfunc\s+\w+/,
    rust_fn: /\bfn\s+\w+/,

    # Class/type definitions
    class_def: /\bclass\s+[A-Z]\w*/,
    struct_def: /\bstruct\s+\w+/,
    interface_def: /\binterface\s+\w+/,

    # Control flow
    if_statement: /\bif\s+/,
    for_loop: /\bfor\s+/,
    while_loop: /\bwhile\s+/,
    match_case: /\bmatch\s+|\bcase\s+/,

    # Imports
    require_stmt: /\brequire\s+['"]|require_relative/,
    import_stmt: /\bimport\s+/,
    use_stmt: /\buse\s+/,
    include_stmt: /\binclude\s+/,

    # Testing
    test_describe: /\bdescribe\s+['"]|RSpec\.describe/,
    test_it: /\bit\s+['"]|test\s+['"]/,
    test_expect: /\bexpect\(|assert[A-Z_]/,

    # Type annotations
    type_annotation: /:\s*\w+\s*[,\)=]|<\w+>/,

    # Comments
    line_comment: /#\s|\/\//,
    block_comment: /\/\*|"""|'''/,
  }

  # Language-specific keywords
  LANGUAGE_KEYWORDS = {
    ruby: %w[end do elsif unless yield puts attr_accessor attr_reader module],
    python: %w[elif pass lambda self __init__ print None True False],
    javascript: %w[const let var async await null undefined console],
    typescript: %w[interface type enum namespace readonly private public],
    go: %w[package defer chan goroutine make nil fmt],
    rust: %w[let mut impl pub mod crate unsafe Option Result Some None],
    java: %w[public private static void extends implements throws final],
    cpp: %w[include namespace std cout cin template virtual override],
  }

  def initialize(code)
    @code = code
    @tokens = []
  end

  def tokenize
    extract_syntax_patterns
    extract_keywords
    extract_operators
    extract_structure_features
    @tokens.join(' ')
  end

  private

  def extract_syntax_patterns
    SYNTAX_PATTERNS.each do |name, pattern|
      count = @code.scan(pattern).length
      count.times { @tokens << name.to_s } if count > 0
    end
  end

  def extract_keywords
    words = @code.downcase.scan(/\b[a-z_][a-z0-9_]*\b/)

    LANGUAGE_KEYWORDS.each do |lang, keywords|
      keywords.each do |kw|
        if words.include?(kw)
          @tokens << "#{lang}_keyword_#{kw}"
          @tokens << "lang_#{lang}"
        end
      end
    end
  end

  def extract_operators
    # Significant operators by language
    @tokens << 'op_arrow' if @code.include?('=>') || @code.include?('->')
    @tokens << 'op_rocket' if @code.include?('<=>')
    @tokens << 'op_pipe' if @code.match?(/\|>|\|/)
    @tokens << 'op_double_colon' if @code.include?('::')
    @tokens << 'op_triple_equals' if @code.include?('===')
    @tokens << 'op_spread' if @code.include?('...')
    @tokens << 'op_null_coalesce' if @code.match?(/\?\?|&\./)
  end

  def extract_structure_features
    lines = @code.split("\n")

    # Indentation style
    if lines.any? { |l| l.start_with?('  ') && !l.start_with?('    ') }
      @tokens << 'indent_2space'
    elsif lines.any? { |l| l.start_with?('    ') }
      @tokens << 'indent_4space'
    elsif lines.any? { |l| l.start_with?("\t") }
      @tokens << 'indent_tab'
    end

    # Bracket style
    @tokens << 'bracket_curly' if @code.include?('{')
    @tokens << 'bracket_significant_whitespace' unless @code.include?('{') || @code.include?(';')

    # Line endings
    @tokens << 'semicolon_terminated' if @code.count(';') > lines.length / 2
  end
end

The Code Classifier

Create code_classifier.rb:

require 'classifier'
require 'json'
require_relative 'code_tokenizer'

class CodeClassifier
  def initialize
    @language_lsi = Classifier::LSI.new(auto_rebuild: false)
    @purpose_knn = Classifier::KNN.new(k: 3, weighted: true)
    @languages = []
    @purposes = []
  end

  # Train language detection
  def train_language(language, code_samples)
    @languages << language.to_s unless @languages.include?(language.to_s)

    Array(code_samples).each do |code|
      tokenized = CodeTokenizer.new(code).tokenize
      @language_lsi.add_item(tokenized, language.to_s)
    end
  end

  # Train purpose detection
  def train_purpose(purpose, code_samples)
    @purposes << purpose.to_s unless @purposes.include?(purpose.to_s)

    Array(code_samples).each do |code|
      tokenized = CodeTokenizer.new(code).tokenize
      @purpose_knn.add(purpose.to_sym => tokenized)
    end
  end

  def build_index
    @language_lsi.build_index
  end

  # Classify a code snippet
  def classify(code)
    tokenized = CodeTokenizer.new(code).tokenize

    language = @language_lsi.classify(tokenized)
    lang_confidence = calculate_language_confidence(tokenized)

    purpose_result = @purpose_knn.classify_with_neighbors(tokenized)
    purpose = purpose_result[:category]
    purpose_confidence = (purpose_result[:confidence] * 100).round(1)

    {
      language: {
        detected: language,
        confidence: lang_confidence,
        alternatives: get_language_alternatives(tokenized)
      },
      purpose: {
        detected: purpose,
        confidence: purpose_confidence
      },
      tokens_used: tokenized.split.uniq.first(10)
    }
  end

  # Quick language detection
  def detect_language(code)
    tokenized = CodeTokenizer.new(code).tokenize
    @language_lsi.classify(tokenized)
  end

  # Quick purpose detection
  def detect_purpose(code)
    tokenized = CodeTokenizer.new(code).tokenize
    @purpose_knn.classify(tokenized)
  end

  def save(path)
    build_index

    @language_lsi.storage = Classifier::Storage::File.new(path: "#{path}_language.json")
    @language_lsi.save

    @purpose_knn.storage = Classifier::Storage::File.new(path: "#{path}_purpose.json")
    @purpose_knn.save

    File.write("#{path}_meta.json", { languages: @languages, purposes: @purposes }.to_json)
  end

  def self.load(path)
    classifier = new

    lsi_storage = Classifier::Storage::File.new(path: "#{path}_language.json")
    knn_storage = Classifier::Storage::File.new(path: "#{path}_purpose.json")

    classifier.instance_variable_set(:@language_lsi, Classifier::LSI.load(storage: lsi_storage))
    classifier.instance_variable_set(:@purpose_knn, Classifier::KNN.load(storage: knn_storage))

    meta = JSON.parse(File.read("#{path}_meta.json"), symbolize_names: true)
    classifier.instance_variable_set(:@languages, meta[:languages])
    classifier.instance_variable_set(:@purposes, meta[:purposes])
    classifier
  end

  private

  def calculate_language_confidence(tokenized)
    result = @language_lsi.classify_with_confidence(tokenized)
    ((result[1] || 0) * 100).round(1)
  end

  def get_language_alternatives(tokenized)
    proximity = @language_lsi.proximity_array_for_content(tokenized)
    return [] if proximity.empty?

    # Group by language and get top alternatives
    lang_scores = Hash.new { |h, k| h[k] = [] }
    proximity.first(10).each do |content, score|
      lang = @language_lsi.categories_for(content).first
      lang_scores[lang] << score
    end

    lang_scores
      .transform_values { |scores| (scores.sum / scores.length * 100).round(1) }
      .sort_by { |_, score| -score }
      .first(3)
      .map { |lang, score| { language: lang, score: score } }
  end
end

Training Data

Create train.rb:

require_relative 'code_classifier'

classifier = CodeClassifier.new

# Ruby samples
classifier.train_language(:ruby, [
  <<~RUBY,
    class User
      attr_accessor :name, :email

      def initialize(name, email)
        @name = name
        @email = email
      end

      def greet
        puts "Hello, #{name}!"
      end
    end
  RUBY
  <<~RUBY,
    module Enumerable
      def my_map
        result = []
        each { |item| result << yield(item) }
        result
      end
    end
  RUBY
  <<~RUBY,
    require 'json'

    def parse_config(path)
      JSON.parse(File.read(path), symbolize_names: true)
    rescue Errno::ENOENT
      {}
    end
  RUBY
])

# Python samples
classifier.train_language(:python, [
  <<~PYTHON,
    class User:
        def __init__(self, name, email):
            self.name = name
            self.email = email

        def greet(self):
            print(f"Hello, {self.name}!")
  PYTHON
  <<~PYTHON,
    import json
    from pathlib import Path

    def parse_config(path):
        try:
            return json.loads(Path(path).read_text())
        except FileNotFoundError:
            return {}
  PYTHON
  <<~PYTHON,
    def fibonacci(n):
        if n <= 1:
            return n
        return fibonacci(n-1) + fibonacci(n-2)
  PYTHON
])

# JavaScript samples
classifier.train_language(:javascript, [
  <<~JS,
    class User {
      constructor(name, email) {
        this.name = name;
        this.email = email;
      }

      greet() {
        console.log(`Hello, ${this.name}!`);
      }
    }
  JS
  <<~JS,
    const parseConfig = async (path) => {
      try {
        const data = await fs.readFile(path, 'utf8');
        return JSON.parse(data);
      } catch (e) {
        return {};
      }
    };
  JS
  <<~JS,
    function fibonacci(n) {
      if (n <= 1) return n;
      return fibonacci(n - 1) + fibonacci(n - 2);
    }
  JS
])

# Go samples
classifier.train_language(:go, [
  <<~GO,
    package main

    import "fmt"

    type User struct {
        Name  string
        Email string
    }

    func (u *User) Greet() {
        fmt.Printf("Hello, %s!", u.Name)
    }
  GO
  <<~GO,
    package config

    import (
        "encoding/json"
        "os"
    )

    func ParseConfig(path string) (map[string]interface{}, error) {
        data, err := os.ReadFile(path)
        if err != nil {
            return nil, err
        }
        var config map[string]interface{}
        json.Unmarshal(data, &config)
        return config, nil
    }
  GO
])

# Rust samples
classifier.train_language(:rust, [
  <<~RUST,
    struct User {
        name: String,
        email: String,
    }

    impl User {
        fn new(name: &str, email: &str) -> Self {
            User {
                name: name.to_string(),
                email: email.to_string(),
            }
        }

        fn greet(&self) {
            println!("Hello, {}!", self.name);
        }
    }
  RUST
  <<~RUST,
    use std::fs;
    use serde_json::Value;

    fn parse_config(path: &str) -> Result<Value, Box<dyn std::error::Error>> {
        let data = fs::read_to_string(path)?;
        let config: Value = serde_json::from_str(&data)?;
        Ok(config)
    }
  RUST
])

# Purpose: Test code
classifier.train_purpose(:test, [
  <<~TEST,
    RSpec.describe User do
      describe '#greet' do
        it 'returns a greeting message' do
          user = User.new('Alice', 'alice@example.com')
          expect(user.greet).to eq('Hello, Alice!')
        end
      end
    end
  TEST
  <<~TEST,
    describe('User', () => {
      test('greet returns greeting', () => {
        const user = new User('Alice', 'alice@example.com');
        expect(user.greet()).toBe('Hello, Alice!');
      });
    });
  TEST
  <<~TEST,
    import pytest

    def test_user_greet():
        user = User('Alice', 'alice@example.com')
        assert user.greet() == 'Hello, Alice!'
  TEST
])

# Purpose: API endpoint
classifier.train_purpose(:api, [
  <<~API,
    get '/users/:id' do
      content_type :json
      user = User.find(params[:id])
      user.to_json
    end

    post '/users' do
      user = User.create(JSON.parse(request.body.read))
      status 201
      user.to_json
    end
  API
  <<~API,
    app.get('/users/:id', async (req, res) => {
      const user = await User.findById(req.params.id);
      res.json(user);
    });

    app.post('/users', async (req, res) => {
      const user = await User.create(req.body);
      res.status(201).json(user);
    });
  API
])

# Purpose: Data processing
classifier.train_purpose(:data_processing, [
  <<~DATA,
    users
      .filter { |u| u.active? }
      .map { |u| { name: u.name, email: u.email } }
      .sort_by { |u| u[:name] }
      .each { |u| process(u) }
  DATA
  <<~DATA,
    users
      .filter(u => u.active)
      .map(u => ({ name: u.name, email: u.email }))
      .sort((a, b) => a.name.localeCompare(b.name))
      .forEach(u => process(u));
  DATA
])

# Purpose: Configuration
classifier.train_purpose(:config, [
  <<~CONFIG,
    Rails.application.configure do
      config.cache_classes = true
      config.eager_load = true
      config.log_level = :info
    end
  CONFIG
  <<~CONFIG,
    module.exports = {
      entry: './src/index.js',
      output: {
        path: path.resolve(__dirname, 'dist'),
        filename: 'bundle.js'
      },
      plugins: [new HtmlWebpackPlugin()]
    };
  CONFIG
])

classifier.build_index
classifier.save('code_classifier')

puts "Trained on #{classifier.instance_variable_get(:@languages).length} languages"
puts "Trained on #{classifier.instance_variable_get(:@purposes).length} purposes"

Using the Classifier

Create classify.rb:

require_relative 'code_classifier'

classifier = CodeClassifier.load('code_classifier')

test_snippets = [
  {
    label: "Ruby with RSpec",
    code: <<~CODE
      describe Calculator do
        it 'adds two numbers' do
          expect(Calculator.add(2, 3)).to eq(5)
        end
      end
    CODE
  },
  {
    label: "Python function",
    code: <<~CODE
      def process_data(items):
          result = []
          for item in items:
              if item.is_valid():
                  result.append(transform(item))
          return result
    CODE
  },
  {
    label: "JavaScript API",
    code: <<~CODE
      router.get('/api/products', async (req, res) => {
        const products = await Product.findAll();
        res.json({ data: products });
      });
    CODE
  },
  {
    label: "Go struct",
    code: <<~CODE
      type Config struct {
          Host string `json:"host"`
          Port int    `json:"port"`
      }

      func LoadConfig(path string) (*Config, error) {
          data, err := os.ReadFile(path)
          if err != nil {
              return nil, err
          }
          var config Config
          json.Unmarshal(data, &config)
          return &config, nil
      }
    CODE
  },
  {
    label: "Rust with Result",
    code: <<~CODE
      fn divide(a: f64, b: f64) -> Result<f64, String> {
          if b == 0.0 {
              Err("Cannot divide by zero".to_string())
          } else {
              Ok(a / b)
          }
      }
    CODE
  }
]

puts "=" * 70
puts "CODE SNIPPET CLASSIFIER"
puts "=" * 70

test_snippets.each do |snippet|
  puts "\n#{"-" * 70}"
  puts "Sample: #{snippet[:label]}"
  puts snippet[:code].lines.first(5).map { |l| "  #{l}" }.join
  puts "  ..." if snippet[:code].lines.length > 5
  puts

  result = classifier.classify(snippet[:code])

  puts "Language: #{result[:language][:detected]} (#{result[:language][:confidence]}%)"
  if result[:language][:alternatives].any?
    alts = result[:language][:alternatives].map { |a| "#{a[:language]}=#{a[:score]}%" }.join(", ")
    puts "  Alternatives: #{alts}"
  end

  puts "Purpose: #{result[:purpose][:detected]} (#{result[:purpose][:confidence]}%)"
  puts "Key tokens: #{result[:tokens_used].join(', ')}"
end

Run it:

ruby train.rb
ruby classify.rb

Output:

======================================================================
CODE SNIPPET CLASSIFIER
======================================================================

----------------------------------------------------------------------
Sample: Ruby with RSpec
  describe Calculator do
    it 'adds two numbers' do
      expect(Calculator.add(2, 3)).to eq(5)
    end
  end

Language: ruby (85.1%)
  Alternatives: ruby=74.5%, javascript=24.1%, python=13.9%
Purpose: test (56.4%)
Key tokens: test_it, test_expect, ruby_keyword_end, lang_ruby, ruby_keyword_do, indent_2space, bracket_significant_whitespace

IDE Integration

Create a simple CLI tool:

#!/usr/bin/env ruby
# detect_language.rb

require_relative 'code_classifier'

classifier = CodeClassifier.load('code_classifier')

# Read from stdin or file
code = ARGV[0] ? File.read(ARGV[0]) : $stdin.read

result = classifier.classify(code)

puts result[:language][:detected]

Usage:

# From file
ruby detect_language.rb mystery_file.txt

# From clipboard (macOS)
pbpaste | ruby detect_language.rb

# Output just the language for scripting
ruby detect_language.rb file.txt  # => "ruby"

Extending to More Languages

# Add TypeScript
classifier.train_language(:typescript, [
  <<~TS,
    interface User {
      name: string;
      email: string;
    }

    class UserService {
      private users: User[] = [];

      async findById(id: string): Promise<User | undefined> {
        return this.users.find(u => u.id === id);
      }
    }
  TS
])

# Add Java
classifier.train_language(:java, [
  <<~JAVA,
    public class User {
        private String name;
        private String email;

        public User(String name, String email) {
            this.name = name;
            this.email = email;
        }

        public String getName() {
            return name;
        }
    }
  JAVA
])

Best Practices

More samples = better accuracy: 5-10 samples per language
Diverse samples: Include different coding styles and patterns
Clean samples: Remove comments that mention the language name
Real code: Use actual project code, not artificial examples

Next Steps

LSI Basics - Deep dive into semantic analysis
kNN Basics - Understanding nearest neighbors
TF-IDF Guide - Term weighting for code analysis