A technical interview template for implementing efficient character prefix conditioning algorithms for language model code completion.
This project simulates a real technical interview at Cursor, focusing on a core algorithmic challenge in language model-based code completion: character prefix conditioning.
When using language models for code completion, we need to produce completions that begin with what the user has typed. However, language models operate on tokens, not characters. If the user's cursor doesn't lie on a token boundary, naive tokenization produces incorrect results.
Your task: Design and implement an algorithm that samples token sequences conditional on a character prefix, ensuring the concatenated token representations start with the given prefix.
Sample sequence s = t₁, t₂, ..., tₙ from distribution p(s) where:
p(s) = p(t₁, t₂, ..., tₙ) = ∏ₖ₌₁ⁿ p(tₖ | t₁, ..., tₖ₋₁)- Constraint:
Pis a prefix ofrepr(t₁) + repr(t₂) + ... + repr(tₙ) - Goal: Sample from
q(s) = p(s | s starts with P)
src/
├── lib/
│ ├── prefixConditioning.ts # 🎯 Main algorithm implementation
│ ├── tokenFilter.ts # 🔍 Token filtering and constraint logic
│ ├── probabilitySampler.ts # 📊 Probability sampling with normalization
│ └── mockLanguageModel.ts # 🤖 Mock tokenizer and language model
└── app/
└── page.tsx # 🖥️ Interactive demo interface
-
Install dependencies:
npm install
-
Start the development server:
npm run dev
-
Open the challenge: Navigate to http://localhost:3000
Design an efficient system architecture focusing on:
- Tokenization Strategy: Map between character positions and token boundaries
- Token Filtering Architecture: Efficiently identify valid candidate tokens
- Probability Distribution Management: Handle constraint-based filtering effects
- Optimization & Caching: Minimize language model API calls
- How do you efficiently find tokens that could start with prefix P?
- How do you track which token combinations still satisfy the constraint?
- How do you maintain probability distributions while filtering?
- What edge cases exist (empty prefix, no valid tokens, Unicode)?
Implement the core algorithm in the provided file structure:
-
Core Algorithm (
src/lib/prefixConditioning.ts)- Main character prefix conditioning function
- Autoregressive sampling with character constraints
- Integration with mock language model interface
-
Token Filtering (
src/lib/tokenFilter.ts)- Find tokens compatible with character prefix
- Validate token sequences against prefix constraint
- Efficient prefix matching algorithms
-
Probability Sampling (
src/lib/probabilitySampler.ts)- Sample from filtered probability distributions
- Implement proper normalization after filtering
- Handle edge cases (no valid tokens, zero probabilities)
-
Mock Infrastructure (
src/lib/mockLanguageModel.ts)- Simple tokenizer implementation
- Mock language model with probability distributions
- Test data for algorithm validation
-
Interactive Demo (
src/app/page.tsx)- Input field for character prefix
- Real-time algorithm execution
- Display sampled tokens and resulting text
function sampleWithCharacterPrefix(prefix: string, maxTokens: number) {
const result = [];
let currentPrefix = prefix;
for (let position = 1; position <= maxTokens; position++) {
let candidates;
if (position === 1) {
candidates = findFirstTokenCandidates(currentPrefix);
} else {
candidates = findNextTokenCandidates(result, currentPrefix);
}
if (candidates.length === 0) break;
const probabilities = getFilteredProbabilities(candidates, result);
const token = sampleFromDistribution(candidates, probabilities);
result.push(token);
const tokenText = getTokenRepresentation(token);
if (tokenText.startsWith(currentPrefix)) {
currentPrefix = tokenText.slice(currentPrefix.length);
if (currentPrefix.length === 0) {
// Prefix fully satisfied, switch to normal sampling
return [...result, ...sampleNormally(maxTokens - position)];
}
}
}
return result;
}- Start with a simple tokenizer (space-separated words)
- Implement exact string matching before optimization
- Add comprehensive test cases for edge conditions
- Consider time complexity - aim for O(V) per token where V is vocab size
- Handle the transition from constrained to unconstrained sampling
- Ensure proper probability normalization after filtering
- Consider Unicode characters and multi-byte sequences
- Implement efficient caching for repeated prefix queries
- Create deterministic test cases with known outcomes
- Test edge cases: empty prefix, no valid tokens, partial matches
- Validate probability distributions sum to 1 after filtering
- Test with various temperature and sampling parameters
- Correctness: Character prefix constraint satisfaction
- Efficiency: Token filtering and probability sampling performance
- Edge Cases: Proper handling of boundary conditions
- Code Quality: Clear structure, testing, and documentation
- Understanding: Demonstrated knowledge of language model sampling principles
# Install dependencies
npm install
# Start development server
npm run dev
# Build for production
npm run build
# Run linter
npm run lint
# Run type checking
npm run type-checkThis is a template for conducting technical interviews. The implementation should focus on:
- Algorithmic thinking: How to efficiently solve the constraint satisfaction problem
- System design: Scalable architecture for real-world usage
- Code quality: Clean, maintainable, and well-documented code
- Testing: Comprehensive coverage of edge cases and scenarios
Good luck with the implementation! 🚀