Skip to content

zench2302/genai-portfolio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

81 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 Jia Jia – Applied Data Science Project Portfolio

Welcome! This is a collection of my recent projects in Data Science built during my MSc Data Science study at LSE. My focus has been on end-to-end ML&DL pielines, LLM-based recommendation systems, retrieval-augmented generation (RAG), and prompt engineering with open-source tools.

πŸ“š Table of Contents

βš™οΈ Core ML & DL Modeling (foundation skills)

🧠 Applied GenAI & Product Innovation (frontier applications)

πŸ§‘β€πŸŽ¨ Creative Tech


Description

This project, in collaboration with LSE Philanthropy and Global Engagement (PAGE), applies data-driven methods to analyze the drivers of alumni legacy donation. By combining statistical and machine learning models on real alumni records, we identify demographic, engagement, and giving factors that shape legacy pledges and provide actionable insights for PAGE’s marketing strategy.

Machine Learning Models

  • Gradient-based models: GBDT, XGBoost, LightGBM, CatBoost
  • Traditional models: logistic regression with/without L1, Decision Trees, Random Forest
  • Unsupervised learning: Factor Analysis, Cluster Analsis

Pipeline & Key Visuals

Pipeline

image

Key Visuals

1:ROC Curve of Fine-tuned Models Using Random Search


2.Top 10 XGBoost Feature Importances (Random Search)

Highlights

  • Consistent giving behavior and total donation amount in other categories are the strongest predictors of legacy donation.

  • Engagement activities such as networking event participation and alumni circle involvement significantly increase the likelihood of legacy pledges.

  • Gradient-boosting models (GBDT, XGBoost) provided the best predictive performance, confirming and extending insights from logistic regression and traditional tree models.

Recommendations

  • Apply predictive scoring to identify high-propensity alumni and focus outreach resources where impact is greatest.

  • Use model insights to design targeted marketing campaigns, highlighting engagement opportunities most likely to convert interested donors into legacy pledgers.

  • Leverage donor journey patterns to tailor messaging, ensuring communications align with alumni giving behaviors and engagement profiles.


Description

  • Developed a deep learning pipeline for robust emotion classification from speech across four benchmark datasets (SAVEE, RAVDESS, CREMA-D, TESS).
  • Combined feature extraction (HuBERT, spectrograms) with advanced sequence modeling to capture nuanced prosodic and spectral patterns.

Key Technologies

  • Deep learning models: CNN, BiLSTM, GRU, Attention, Multi-Head Attention
  • Pre-trained speech representations: HuBERT
  • Optimization: Data augmentation (noise, pitch, tempo), Label smoothing, AdamW

Highlights

  • Achieved 87.6% validation accuracy with BiLSTM + Attention model after augmentation and optimization.
  • Implemented joint fine-tuning of HuBERT + classifier to adapt large-scale speech representations for downstream tasks.
  • Demonstrated generalization across heterogeneous datasets, simulating real-world home environments.

Description

  • Traditional econometric nowcasting methods suffer from overfitting and miss complex real-time relationships.
  • Managing missing data, limited historical records, and high-dimensional features while maintaining interpretability.
  • Advanced data engineering with automated feature generation, robust selection algorithms, and optimised ML models for more reliable forecasts

Key Technologies

  • machine learning models: Random Forest, SVM, GBDT, XGBoost
  • deep learning models: CNN, LSTMm Transformer
  • feature engineering using NLP

Highlights

  • Forecasts have improved the accuracy for nowcasting UK GDP by 40%
  • Dynamic machine learning framework rapidly evaluates new economic signals to enhance predictions
  • Daily forecasts powered by real-time text analysis capture the latest market shifts and sentiment

Description

Developed a deep learning framework for classifying brain MRI scans into four categories: no tumour, glioma, meningioma, and pituitary. Conducted a comparative study of five state-of-the-art CNN architectures, leveraging transfer learning, fine-tuning, and data augmentation to improve diagnostic accuracy on limited medical imaging data.

Key Technologies

  • CNN architectures: EfficientNet, ResNet-50, ResNet-101, Inception V3, VGG16
  • Transfer learning with ImageNet pre-trained weights
  • Data preprocessing: normalization, resizing, batching with generators
  • Data augmentation: rotation, flipping, cropping, shifting
  • Fine-tuning with selective layer unfreezing and early stopping

Highlights

  • Achieved 95% validation accuracy with Inception V3 after fine-tuning
  • Showed that shallower ResNet-50 outperformed deeper ResNet-101 under limited data
  • Demonstrated practical application of pretrained CNNs in medical diagnosis
  • Identified dataset size and compute limitations as key challenges, guiding future research

Description

Developed an end-to-end recommendation system powered by LLMs for Amazon product reviews. The system embeds product metadata and review content using Flan-T5 and MiniLM, then computes similarity via Faiss for Top-5 recommendations.

Key Technologies

  • Nous-Hermes-2 (Mistral) for generating user profiles from review history
  • Flan-T5 for generating ad-style recommendation reasons
  • BGE (BAAI) and MiniLM embeddings for product and user vectorization
  • FAISS for approximate nearest neighbor (ANN) vector search and candidate retrieval
  • Prompt engineering for review summarization and recommendation reasoning

System and Recommendation Pipeline

Prompts Engineering

(1) User Profiling Prompt (Mistral)

You are a professional shopping assistant.

Analyze the following user reviews and summarize their preferences.

Reviews:
""" <user reviews> """

Return JSON with:
- "preferred_products"
- "liked_features"
- "dislikes"
- "potential_interests"
πŸ”§ Full Python Implementation (click to expand)
def generate_user_profile(user_reviews):
    prompt = f"""
You are a professional shopping assistant.

Analyze the following user reviews and summarize their preferences.

Reviews:
\"\"\"{user_reviews[:Config.MAX_REVIEW_LENGTH]}\"\"\"

Return JSON with:
- "preferred_products"
- "liked_features"
- "dislikes"
- "potential_interests"
"""
    inputs = profile_tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024).to(profile_model.device)
    outputs = profile_model.generate(
        **inputs,
        max_new_tokens=Config.MAX_NEW_TOKENS,
        temperature=Config.TEMPERATURE,
        top_p=Config.TOP_P,
        repetition_penalty=Config.REPETITION_PENALTY,
        pad_token_id=profile_tokenizer.eos_token_id
    )
    raw_output = profile_tokenizer.decode(outputs[0], skip_special_tokens=True)
    json_str = raw_output[raw_output.find("{"):raw_output.rfind("}")+1]
    return json.loads(json_str)

(2) Ad-Slogan Prompt summary (Flan-T5)

You are an expert e-commerce copywriter creating unique, playful ad slogans.

Product:
- Title: ...
- Description: ...
- Rating: ...

User:
- Likes: ...
- Dislikes: ...

Your task:
- Write ONE catchy slogan (≀12 words)
- Avoid repeating product name or brand
- Use playful, emotional, or surprising tone
πŸ”§ Full Python Implementation (click to expand)
def build_ad_prompt(product_info, user_profile):
    title = product_info.get('title', 'Unknown Product')
    description = " ".join(product_info.get('description', [])) if isinstance(product_info.get('description'), list) else product_info.get('description', '')
    details = product_info.get('details', '')
    avg_rating = product_info.get('average_rating', 0)
    preferred_products = ", ".join(user_profile.get('preferred_products', []))
    liked_features = ", ".join(user_profile.get('liked_features', []))
    dislikes = ", ".join(user_profile.get('dislikes', []))
    potential_interests = ", ".join(user_profile.get('potential_interests', []))

    return f"""
You are an expert e-commerce copywriter creating unique, playful ad slogans.

Product:
- Title: {title}
- Description: {description}
- Details: {details}
- Average rating: {avg_rating}

User:
- Preferred products: {preferred_products}
- Likes: {liked_features}
- Dislikes: {dislikes}
- Interests: {potential_interests}

Your task:
- Write ONE catchy slogan (≀12 words) that excites this user.
- Match the product type (nails, hair, skincare, lashes, tools, etc.).
- Highlight the user’s likes, avoid their dislikes.
- Use playful, emotional, or surprising language.
- Do NOT copy product name, specs, or brand.
- If irrelevant, return only: SKIP.

Output:
"""

πŸ’‘ Full prompt templates available in PROMPTS.md

Selected Outputs

Evaluation Snapshot

The following metrics reflect the system's performance on non-cold-start users only!

Metric Score (non-cold start)
Semantic Match (CosSim) 71.8%
Ad Diversity 82.6%
Avg. Product Rating 4.31

Description

Built a real-time system to extract dominant topics and sentiment streams from Twitch chat logs. Designed to detect mood shifts and trending discussion in live streams.

Key Technologies

  • BERTopic + UMAP + HDBSCAN for topic modeling
  • Vader + GPT-based sentiment refinement
  • WebSocket-based data ingestion
  • Tokenized message streams with temporal segmentation

Pipeline & Key Visuals

Pipeline

Highlights

  • Capable of handling 10K+ chat lines per minute
  • Visual clustering of evolving discussion topics
  • Used GPT-4 for refining topic labeling and summary

vibegoLogo

Description

An AI-assisted travel itinerary generator that combines OpenAI's GPT API with Google Maps data to produce city-specific, time-aware plans.

Key Technologies

  • OpenAI (GPT-4) for summarization and suggestion
  • Google Maps API for location + travel time data
  • Prompt chaining for adaptive personalization
  • Firebase for session handling

Highlights

  • Modular prompt design to adapt to user preferences
  • Route optimization based on timing and transport
  • Designed for deployment as a lightweight web app

Creative Tech

1. 🎧 Live Coding with Strudel: Exploring Code-Based Music Interaction

London Data Week 2025 – Algorave Workshop @ King’s Institute for AI

In this hands-on workshop, I explored the intersection of code, rhythm, and creative expression using Strudel β€” a browser-based live coding tool for generative music. Participants learned to construct musical patterns through time-based code snippets and were invited to perform their compositions in an open stage format.

While I didn’t create a full performance piece, I gained direct experience in:

  • Writing loop-based musical patterns using declarative syntax

  • Understanding how code structure translates to rhythm and timing

  • Experiencing the dynamics of collaborative, real-time generative systems

The session also prompted reflection on the human-computer interaction aspect of creative coding: how intuitive (or not) such tools are for newcomers, and how live-coded performances affect audience perception. This exploration connects back to my broader interest in generative AI and interactive systems design, where usability, expressiveness, and emotional impact must be balanced.

πŸ“« Contact

GitHub: github.com/zench2302
LinkedIn: linkedin.com/in/jia-jia-7a73359a Email:J.Jia9@lse.ac.uk

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors