CS5760 Natural Language Processing - Tejaswini Kolluru-Homework 1

Name: Tejaswini Kolluru
StudentId: 700773943
Course: CS5760 Natural Language Processing

Overview

This homework covers four fundamental NLP concepts:

Regular Expressions - Pattern matching for text processing
Tokenization - Breaking text into meaningful units
Byte Pair Encoding (BPE) - Subword tokenization algorithm
Edit Distance - Computing minimum transformation cost between strings

Question 1: Regular Expressions (Q1.py)

Patterns Implemented:

ZIP Codes: \b\d{5}(?:[-\s]\d{4})?\b - Matches US ZIP codes with optional +4 extension
Non-Capital Words: \b[^A-Z\s][^\s]*(?:['-][^\s]+)*\b - Words not starting with capitals
Rich Numbers: [+-]?(?:\d{1,3}(?:,\d{3})*|\d+)(?:\.\d+)?(?:[eE][+-]?\d+)? - Numbers with signs, commas, decimals, scientific notation
Email Variants: \be[-\s–]?mail\b - Matches "email", "e-mail", "e mail" (case-insensitive)
Go Interjections: \bgo+\b[!.,?]? - Matches "go", "goo", "gooo" with optional punctuation
Question Lines: .*\?[\s"')\]]*$ - Lines ending with question marks and optional quotes/brackets

Usage: Run python Q1.py to test all patterns with sample data.

Question 2: Tokenization (Q2.py)

Components:

Naive Tokenization: Simple space-based splitting
Manual Tokenization: Handles contractions, punctuation, possessives
Tool Comparison: Uses NLTK and spaCy tokenizers
MWE Analysis: Identifies multiword expressions in tech domain

Sample Text: Technology news paragraph about AI startups and NLP

Key Findings:

Manual tokenization expands contractions ("wasn't" → "was not")
Tools handle edge cases better than manual rules
MWEs like "natural language processing" should be single tokens

Dependencies:

pip install nltk spacy
python -m spacy download en_core_web_sm

Question 3: Byte Pair Encoding (Q3.py)

Three Parts:

3.1 Manual BPE on Toy Corpus

Corpus: "low low low low low lowest lowest newer newer newer newer newer newer wider wider wider new new"
Performs first 3 BPE merges by hand
Shows vocabulary growth step-by-step

3.2 Coded BPE Learner

Complete BPE implementation with BPELearner class
Tests on words: "new", "newer", "lowest", "widest", "newestest"
Demonstrates OOV handling and morpheme alignment

3.3 BPE on English Paragraph

Trains on NLP-related text with 30 merges
Segments 5 test words including rare and derived forms
Analyzes learned prefixes, suffixes, and stems

Key Insights:

BPE solves OOV problems through subword composition
Discovers morphological boundaries (e.g., "er_" suffix)
Balances vocabulary size with representation power

Question 4: Edit Distance (Q4.py)

Task: Compute minimum edit distance "Sunday" → "Saturday"

Two Models:

Model A: Substitution=1, Insertion=1, Deletion=1
Model B: Substitution=2, Insertion=1, Deletion=1

Implementation:

Dynamic programming with DP matrix
Backtracking for optimal alignment sequences
Comparison of operation preferences between models

Applications:

Spell checking favors Model A (equal operation costs)
DNA alignment might prefer Model B (expensive substitutions)

How to Run

Install Dependencies:

pip install numpy nltk spacy
python -m spacy download en_core_web_sm

Run Individual Questions:

python Q1.py  # Regular expressions
python Q2.py  # Tokenization
python Q3.py  # BPE
python Q4.py  # Edit distance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS5760 Natural Language Processing - Tejaswini Kolluru-Homework 1

Overview

Question 1: Regular Expressions (Q1.py)

Question 2: Tokenization (Q2.py)

Question 3: Byte Pair Encoding (Q3.py)

3.1 Manual BPE on Toy Corpus

3.2 Coded BPE Learner

3.3 BPE on English Paragraph

Question 4: Edit Distance (Q4.py)

How to Run

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Q1.py		Q1.py
Q2.py		Q2.py
Q3.py		Q3.py
Q4.py		Q4.py
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

CS5760 Natural Language Processing - Tejaswini Kolluru-Homework 1

Overview

Question 1: Regular Expressions (Q1.py)

Question 2: Tokenization (Q2.py)

Question 3: Byte Pair Encoding (Q3.py)

3.1 Manual BPE on Toy Corpus

3.2 Coded BPE Learner

3.3 BPE on English Paragraph

Question 4: Edit Distance (Q4.py)

How to Run

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages