Skip to content

team-headstart/cursor-interview-challenge

Repository files navigation

Cursor Character Prefix Conditioning Challenge

A technical interview template for implementing efficient character prefix conditioning algorithms for language model code completion.

🎯 Challenge Overview

This project simulates a real technical interview at Cursor, focusing on a core algorithmic challenge in language model-based code completion: character prefix conditioning.

The Problem

When using language models for code completion, we need to produce completions that begin with what the user has typed. However, language models operate on tokens, not characters. If the user's cursor doesn't lie on a token boundary, naive tokenization produces incorrect results.

Your task: Design and implement an algorithm that samples token sequences conditional on a character prefix, ensuring the concatenated token representations start with the given prefix.

Mathematical Formulation

Sample sequence s = t₁, t₂, ..., tₙ from distribution p(s) where:

  • p(s) = p(t₁, t₂, ..., tₙ) = ∏ₖ₌₁ⁿ p(tₖ | t₁, ..., tₖ₋₁)
  • Constraint: P is a prefix of repr(t₁) + repr(t₂) + ... + repr(tₙ)
  • Goal: Sample from q(s) = p(s | s starts with P)

🏗️ Project Structure

src/
├── lib/
│   ├── prefixConditioning.ts    # 🎯 Main algorithm implementation
│   ├── tokenFilter.ts           # 🔍 Token filtering and constraint logic
│   ├── probabilitySampler.ts    # 📊 Probability sampling with normalization
│   └── mockLanguageModel.ts     # 🤖 Mock tokenizer and language model
└── app/
    └── page.tsx                 # 🖥️ Interactive demo interface

🚀 Getting Started

  1. Install dependencies:

    npm install
  2. Start the development server:

    npm run dev
  3. Open the challenge: Navigate to http://localhost:3000

📋 Interview Structure

Phase 1: System Design (30 minutes)

Design an efficient system architecture focusing on:

🏛️ Core Components

  • Tokenization Strategy: Map between character positions and token boundaries
  • Token Filtering Architecture: Efficiently identify valid candidate tokens
  • Probability Distribution Management: Handle constraint-based filtering effects
  • Optimization & Caching: Minimize language model API calls

🎯 Key Design Questions

  • How do you efficiently find tokens that could start with prefix P?
  • How do you track which token combinations still satisfy the constraint?
  • How do you maintain probability distributions while filtering?
  • What edge cases exist (empty prefix, no valid tokens, Unicode)?

Phase 2: Implementation (30 minutes)

Implement the core algorithm in the provided file structure:

📝 Implementation Tasks

  1. Core Algorithm (src/lib/prefixConditioning.ts)

    • Main character prefix conditioning function
    • Autoregressive sampling with character constraints
    • Integration with mock language model interface
  2. Token Filtering (src/lib/tokenFilter.ts)

    • Find tokens compatible with character prefix
    • Validate token sequences against prefix constraint
    • Efficient prefix matching algorithms
  3. Probability Sampling (src/lib/probabilitySampler.ts)

    • Sample from filtered probability distributions
    • Implement proper normalization after filtering
    • Handle edge cases (no valid tokens, zero probabilities)
  4. Mock Infrastructure (src/lib/mockLanguageModel.ts)

    • Simple tokenizer implementation
    • Mock language model with probability distributions
    • Test data for algorithm validation
  5. Interactive Demo (src/app/page.tsx)

    • Input field for character prefix
    • Real-time algorithm execution
    • Display sampled tokens and resulting text

🧮 Algorithm Reference

function sampleWithCharacterPrefix(prefix: string, maxTokens: number) {
  const result = [];
  let currentPrefix = prefix;

  for (let position = 1; position <= maxTokens; position++) {
    let candidates;
    if (position === 1) {
      candidates = findFirstTokenCandidates(currentPrefix);
    } else {
      candidates = findNextTokenCandidates(result, currentPrefix);
    }

    if (candidates.length === 0) break;

    const probabilities = getFilteredProbabilities(candidates, result);
    const token = sampleFromDistribution(candidates, probabilities);
    result.push(token);

    const tokenText = getTokenRepresentation(token);
    if (tokenText.startsWith(currentPrefix)) {
      currentPrefix = tokenText.slice(currentPrefix.length);
      if (currentPrefix.length === 0) {
        // Prefix fully satisfied, switch to normal sampling
        return [...result, ...sampleNormally(maxTokens - position)];
      }
    }
  }

  return result;
}

🎯 Implementation Tips

Getting Started

  • Start with a simple tokenizer (space-separated words)
  • Implement exact string matching before optimization
  • Add comprehensive test cases for edge conditions
  • Consider time complexity - aim for O(V) per token where V is vocab size

Key Considerations

  • Handle the transition from constrained to unconstrained sampling
  • Ensure proper probability normalization after filtering
  • Consider Unicode characters and multi-byte sequences
  • Implement efficient caching for repeated prefix queries

Testing Strategy

  • Create deterministic test cases with known outcomes
  • Test edge cases: empty prefix, no valid tokens, partial matches
  • Validate probability distributions sum to 1 after filtering
  • Test with various temperature and sampling parameters

📚 Resources

🏆 Evaluation Criteria

  • Correctness: Character prefix constraint satisfaction
  • Efficiency: Token filtering and probability sampling performance
  • Edge Cases: Proper handling of boundary conditions
  • Code Quality: Clear structure, testing, and documentation
  • Understanding: Demonstrated knowledge of language model sampling principles

🛠️ Development Commands

# Install dependencies
npm install

# Start development server
npm run dev

# Build for production
npm run build

# Run linter
npm run lint

# Run type checking
npm run type-check

📝 Notes

This is a template for conducting technical interviews. The implementation should focus on:

  1. Algorithmic thinking: How to efficiently solve the constraint satisfaction problem
  2. System design: Scalable architecture for real-world usage
  3. Code quality: Clean, maintainable, and well-documented code
  4. Testing: Comprehensive coverage of edge cases and scenarios

Good luck with the implementation! 🚀

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published