- Name: Tejaswini Kolluru
- StudentId: 700773943
- Course: CS5760 Natural Language Processing
This homework covers four fundamental NLP concepts:
- Regular Expressions - Pattern matching for text processing
- Tokenization - Breaking text into meaningful units
- Byte Pair Encoding (BPE) - Subword tokenization algorithm
- Edit Distance - Computing minimum transformation cost between strings
Patterns Implemented:
- ZIP Codes:
\b\d{5}(?:[-\s]\d{4})?\b- Matches US ZIP codes with optional +4 extension - Non-Capital Words:
\b[^A-Z\s][^\s]*(?:['-][^\s]+)*\b- Words not starting with capitals - Rich Numbers:
[+-]?(?:\d{1,3}(?:,\d{3})*|\d+)(?:\.\d+)?(?:[eE][+-]?\d+)?- Numbers with signs, commas, decimals, scientific notation - Email Variants:
\be[-\s–]?mail\b- Matches "email", "e-mail", "e mail" (case-insensitive) - Go Interjections:
\bgo+\b[!.,?]?- Matches "go", "goo", "gooo" with optional punctuation - Question Lines:
.*\?[\s"')\]]*$- Lines ending with question marks and optional quotes/brackets
Usage: Run python Q1.py to test all patterns with sample data.
Components:
- Naive Tokenization: Simple space-based splitting
- Manual Tokenization: Handles contractions, punctuation, possessives
- Tool Comparison: Uses NLTK and spaCy tokenizers
- MWE Analysis: Identifies multiword expressions in tech domain
Sample Text: Technology news paragraph about AI startups and NLP
Key Findings:
- Manual tokenization expands contractions ("wasn't" → "was not")
- Tools handle edge cases better than manual rules
- MWEs like "natural language processing" should be single tokens
Dependencies:
pip install nltk spacy
python -m spacy download en_core_web_smThree Parts:
- Corpus: "low low low low low lowest lowest newer newer newer newer newer newer wider wider wider new new"
- Performs first 3 BPE merges by hand
- Shows vocabulary growth step-by-step
- Complete BPE implementation with
BPELearnerclass - Tests on words: "new", "newer", "lowest", "widest", "newestest"
- Demonstrates OOV handling and morpheme alignment
- Trains on NLP-related text with 30 merges
- Segments 5 test words including rare and derived forms
- Analyzes learned prefixes, suffixes, and stems
Key Insights:
- BPE solves OOV problems through subword composition
- Discovers morphological boundaries (e.g., "er_" suffix)
- Balances vocabulary size with representation power
Task: Compute minimum edit distance "Sunday" → "Saturday"
Two Models:
- Model A: Substitution=1, Insertion=1, Deletion=1
- Model B: Substitution=2, Insertion=1, Deletion=1
Implementation:
- Dynamic programming with DP matrix
- Backtracking for optimal alignment sequences
- Comparison of operation preferences between models
Applications:
- Spell checking favors Model A (equal operation costs)
- DNA alignment might prefer Model B (expensive substitutions)
-
Install Dependencies:
pip install numpy nltk spacy python -m spacy download en_core_web_sm
-
Run Individual Questions:
python Q1.py # Regular expressions python Q2.py # Tokenization python Q3.py # BPE python Q4.py # Edit distance