Data Scientist · AI Researcher · AI Engineer · Data Engineer
MS Data Science, GPA 3.87 · Montclair State University | AWS Certified Solutions Architect – Associate
Data Scientist and AI Researcher working across machine learning, data engineering, and NLP research.
I build production ML pipelines, LLM systems, and real-time data infrastructure — and research how language models behave at scale.
2+ years experience · MS Data Science · AWS Certified · NLP Researcher
Languages
ML & Data Science
AI / LLM & NLP
Data Engineering
Backend & Infrastructure
Visualization & Reporting
Montclair State University · Under Review · 2026
- Studying how fairness and bias shift across LLM generations using bootstrap resampling and Cohen's d across 7 bias benchmarks
- Built a reproducible evaluation pipeline with extensible infrastructure — new datasets and model providers plug in without rewriting core logic
NLP Lab, Montclair State University · In Progress · Targeting BabyLM Workshop 2026
- Trained GPT-2 style models from scratch on 100M tokens per language using PyTorch, Hugging Face, CUDA-enabled SLURM jobs, and distributed training on 4× NVIDIA A100 GPUs
- Analyzed how curriculum-ordered data affects validation loss, perplexity, and generation quality across morphologically different languages
- Custom BPE tokenizer · minimal-pair grammaticality evaluation · 10 seeds
| Project | Stack | Highlight |
|---|---|---|
| FinAgent — Multi-Agent Finance Chatbot | LangGraph · LangChain · GPT-4o · Docker | Supervisor routes to 3 specialist agents via parallel fan-out · ~60% latency reduction · circuit-breaker fault tolerance |
| AI Research Assistant Platform | React · Fastify · ChromaDB · RAG · Docker | PDF ingestion · chunking · embeddings · semantic retrieval · ranked source-attributed answers |
| LLM Safety Gateway | FastAPI · Docker · SQLite | PII · jailbreak · toxicity · ALLOW/WARN/BLOCK verdicts · full audit logging · extensible detector core |
| RAG Review Intelligence | Pinecone · HuggingFace · Claude · FastAPI | 37,778 embeddings · 100K+ orders · semantic search · plain-English queries without SQL |
| Project | Stack | Highlight |
|---|---|---|
| Real-Time E-Commerce Pipeline | Kafka · PySpark Structured Streaming · AWS S3 · Docker · Streamlit | 86,400 events/hr · 3 Kafka topics · $214K+ revenue monitored in real time · 6 Dockerized services |
| Data Lakehouse — Medallion Architecture | PySpark · Delta Lake · Airflow · AWS S3 · Docker | Bronze → Silver → Gold · 100K+ orders across 9 tables · Airflow quality gates block bad data at promotion |
| Multi-Tenant SaaS Analytics Platform | Node.js · PostgreSQL · Redis · Docker | 847 req/s · p95 87ms · 100 concurrent users · Row-Level Security enforces tenant isolation |
| Project | Stack | Highlight |
|---|---|---|
| Walmart M5 Demand Forecasting | XGBoost · Walk-Forward CV · PySpark · scikit-learn | MAPE 0.43 → 0.13 · 70% forecast error reduction · 1M+ records · flagged 3 unpredictable SKUs |
| Customer Churn Prediction | XGBoost · SHAP · FastAPI · Docker · scikit-learn | Recall 55% → 77% · F1-threshold tuning · SHAP explainability · Dockerized API endpoint |
| E-Commerce A/B Test Analysis | SciPy · Bayesian Inference · Pandas · Matplotlib | Prevented $142K revenue loss · 288K sessions · frequentist + Bayesian methods both confirmed result |
| Telugu BabyLM | PyTorch · HuggingFace · GPT-2 · Distributed Training | 17M Telugu tokens · 4× A100 GPUs · curriculum learning · custom BPE tokenizer |
| Project | Stack | Highlight |
|---|---|---|
| Task Management Platform | Spring Boot · React · PostgreSQL · Docker | JWT · RBAC · 10K+ tasks · 40% coordination overhead reduction |
