Skip to content
View pulipakav1's full-sized avatar
🎯
Focusing
🎯
Focusing
  • United States

Block or report pulipakav1

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
pulipakav1/README.md

Rohit Pulipaka

Data Scientist  ·  AI Researcher  ·  AI Engineer  ·  Data Engineer

MS Data Science, GPA 3.87 · Montclair State University  |  AWS Certified Solutions Architect – Associate

 


Data Scientist and AI Researcher working across machine learning, data engineering, and NLP research.
I build production ML pipelines, LLM systems, and real-time data infrastructure — and research how language models behave at scale.

2+ years experience  ·  MS Data Science  ·  AWS Certified  ·  NLP Researcher


🛠️ Stack

Languages

Python SQL Java R

ML & Data Science

PyTorch XGBoost scikit-learn SHAP SciPy

AI / LLM & NLP

LangGraph LangChain HuggingFace OpenAI Pinecone ChromaDB

Data Engineering

Kafka PySpark Airflow Delta Lake AWS S3 BigQuery

Backend & Infrastructure

FastAPI Docker PostgreSQL Redis Spring Boot

Visualization & Reporting

Power BI Streamlit Matplotlib


🔬 Research

LLM Fairness & Bias Research

Montclair State University · Under Review · 2026

  • Studying how fairness and bias shift across LLM generations using bootstrap resampling and Cohen's d across 7 bias benchmarks
  • Built a reproducible evaluation pipeline with extensible infrastructure — new datasets and model providers plug in without rewriting core logic

Multilingual BabyLM: Low-Resource Language Modeling (English, Hindi, Telugu)

NLP Lab, Montclair State University · In Progress · Targeting BabyLM Workshop 2026

  • Trained GPT-2 style models from scratch on 100M tokens per language using PyTorch, Hugging Face, CUDA-enabled SLURM jobs, and distributed training on 4× NVIDIA A100 GPUs
  • Analyzed how curriculum-ordered data affects validation loss, perplexity, and generation quality across morphologically different languages
  • Custom BPE tokenizer · minimal-pair grammaticality evaluation · 10 seeds

🚀 Projects

🤖 AI Engineering

Project Stack Highlight
FinAgent — Multi-Agent Finance Chatbot LangGraph · LangChain · GPT-4o · Docker Supervisor routes to 3 specialist agents via parallel fan-out · ~60% latency reduction · circuit-breaker fault tolerance
AI Research Assistant Platform React · Fastify · ChromaDB · RAG · Docker PDF ingestion · chunking · embeddings · semantic retrieval · ranked source-attributed answers
LLM Safety Gateway FastAPI · Docker · SQLite PII · jailbreak · toxicity · ALLOW/WARN/BLOCK verdicts · full audit logging · extensible detector core
RAG Review Intelligence Pinecone · HuggingFace · Claude · FastAPI 37,778 embeddings · 100K+ orders · semantic search · plain-English queries without SQL

🔧 Data Engineering

Project Stack Highlight
Real-Time E-Commerce Pipeline Kafka · PySpark Structured Streaming · AWS S3 · Docker · Streamlit 86,400 events/hr · 3 Kafka topics · $214K+ revenue monitored in real time · 6 Dockerized services
Data Lakehouse — Medallion Architecture PySpark · Delta Lake · Airflow · AWS S3 · Docker Bronze → Silver → Gold · 100K+ orders across 9 tables · Airflow quality gates block bad data at promotion
Multi-Tenant SaaS Analytics Platform Node.js · PostgreSQL · Redis · Docker 847 req/s · p95 87ms · 100 concurrent users · Row-Level Security enforces tenant isolation

📈 ML & Data Science

Project Stack Highlight
Walmart M5 Demand Forecasting XGBoost · Walk-Forward CV · PySpark · scikit-learn MAPE 0.43 → 0.13 · 70% forecast error reduction · 1M+ records · flagged 3 unpredictable SKUs
Customer Churn Prediction XGBoost · SHAP · FastAPI · Docker · scikit-learn Recall 55% → 77% · F1-threshold tuning · SHAP explainability · Dockerized API endpoint
E-Commerce A/B Test Analysis SciPy · Bayesian Inference · Pandas · Matplotlib Prevented $142K revenue loss · 288K sessions · frequentist + Bayesian methods both confirmed result
Telugu BabyLM PyTorch · HuggingFace · GPT-2 · Distributed Training 17M Telugu tokens · 4× A100 GPUs · curriculum learning · custom BPE tokenizer

💻 Software Engineering

Project Stack Highlight
Task Management Platform Spring Boot · React · PostgreSQL · Docker JWT · RBAC · 10K+ tasks · 40% coordination overhead reduction

📊 GitHub Stats

 

Pinned Loading

  1. finance_agent finance_agent Public

    Multi-agent financial analysis app built with LangGraph, GPT-4o, and Streamlit with live yfinance pricing and portfolio/news tools.

    Python

  2. full_stack full_stack Public

    Full-stack task platform — Spring Boot + React with JWT auth, RBAC, and paginated REST APIs managing 10K+ tasks.

    Java

  3. llm_risk_scorer llm_risk_scorer Public

    Modular LLM safety scoring service — evaluates outputs for PII, toxicity, and jailbreaks. Returns ALLOW/WARN/BLOCK with audit logging.

    Python

  4. rag_data rag_data Public

    Semantic search over 37K+ customer review embeddings via ChromaDB + Claude. Query sentiment in plain English — no SQL required.

    Python

  5. rag_pipeline rag_pipeline Public

    Production-ready RAG pipeline with ChromaDB, Claude, LLM-as-judge eval, FastAPI + Streamlit

    Python

  6. streaming_pipeline streaming_pipeline Public

    Real-time e-commerce pipeline — 86,400 events/hr across 3 Kafka topics, PySpark Structured Streaming, partitioned Parquet on AWS S3.

    Python