show feed (recent posts and activities from other users) on a social network platform
-
Clarifying questions
- What is the primary business objective of the system? (increase user engagement)
- Do we show only posts or also activities from other users?
- What types of engagement are available? (like, click, share, comment, hide, etc)? Which ones are we optimizing for?
- Do we display ads as well?
- What types of data do the posts include? (text, image, video)?
- Are there specific user segments or contexts we should consider (e.g., user demographics)?
- Do we have negative feedback features (such as hide ad, block, etc)?
- What type of user-ad interaction data do we have access to can we use it for training our models?
- Do we need continual training?
- How do we collect negative samples? (not clicked, negative feedback).
- How fast the system needs to be?
- What is the scale of the system?
- Is personalization needed? Yes
-
Use case(s) and business goal
- use case: show friends most engaging (and unseen) posts and activities on a social network platform app (personalized to user)
- business objective: Maximize user engagement (as a set of interactions)
-
Requirements;
- Latency: 200 msec of newsfeed refreshed results after user opens/refreshes the app
- Scalability: 5 B total users, 2 B DAU, refresh app twice
-
Constraints:
- Privacy and compliance with data protection regulations.
-
Data: Sources and Availability:
- Data sources include user interaction logs, ad content data, user profiles, and contextual information.
- Historical click and impression data for model training and evaluation.
-
Assumptions:
- Users' engagement behavior can be characterized by their explicit (e.g. like, click, share, comment, etc) or implicit interactions (e.g. dwell time)
-
ML Formulation:
- Objective:
- maximize number of explicit, implicit, or both type of reactions (weighted)
- implicit: more data, explicit: stronger signal, but less data -> weighted score of different interactions: share > comment > like > click etc
- I/O: I: user_id, O: ranked list of unseen posts sorted by engagement score (wighted sum)
- Category: Ranking problem: can be solved as pointwise LTR with multi/label (multi-task) classification
- Objective:
- Offline
- ROC AUC (trade-off b/w TPR and FPR)
- Online
- CTR,
- Reactions rate (like rate, comment rate, etc)
- Time spent
- User satisfaction (survey)
- High level architecture
- We can use point-wise learning to rank (LTR) formulation
- Options for multi-label/task classification:
- Use N independent classifiers (expensive to train and maintain)
- Use a multi-task classifier
- learn multi tasks simultaneously
- single shared layers (learns similarities between tasks) -> transformed features
- task specific layers: classification heads
- pros: single model, shared layers prevent redundancy, train data for each task can be used for others as well (limited data)
-
Data Sources
- Users,
- Posts,
- User-post interaction
- User-user (friendship)
-
Labelling
-
Feature selection
- Posts:
- Text
- Image/videos
- No of reactions (likes, shares, replies, etc)
- Age
- Hashtags
- User:
- ID, username
- Demographics (Age, gender, location)
- Context (device, time of day, etc)
- Interaction history (e.g. user click rate, total clicks, likes, et )
- User-Post interaction:
- IDs(user, Ad), interaction type, time, location
- User-user(post author) affinities
- connection type
- reaction history (No liked/commented/etc posts from author)
- Posts:
-
Feature representation / preparation
-
Text:
- use a pre-trained LM to get embeddings
- use BERT here (posts are in phrases usually, context aware helps)
-
Image / Video:
- preprocess
- use pre-trained models e.g. SimCLR / CLIP to convert -> feature vector
-
Dense numerical features:
- Engagement feats (No of clicks, etc)
- use directly + scale the range
- Engagement feats (No of clicks, etc)
-
Discrete numerical:
- Age: bucketize into categorical then one hot
-
Hashtags:
- tokenize, token to ID, simple vectorization (TF-IDF or word2vec) - no context
-
-
Model selection
- We choose NN
- unstructured data (text, img, video)
- embedding layers for categorical features
- fine tune pre-trained models used for feat eng.
- multi-labels
- P(click), P(like), P(Share), P(comment)
- Two options:
- N NN classifiers
- Multi task NN (choose this)
- Shared layers
- Classification heads (click, like, share, comment)
- Passive users problem:
- All their Ps will be small
- add two more heads
- Dwell time (seconds spent on post)
- P(skip) (skip = spend time < t)
- We choose NN
-
Model Training
- Loss function:
- L = sum of L_is for each task
- for binary classif tasks: CE
- for regression task: MAE, MSE, or Huber loss
- Dataset
- use features, post features, interactions, labels
- labels: positive, negative for each task (like, didn't like etc)
- for dwell time: it's a regression
- Imbalanced dataset: downsample negative
- Model eval and HP tuning
- Iterations
- Loss function:
-
Data Prep pipeline
- static features -> batch feature compute (daily, weekly) -> feature store
- dynamic features: # of post clicks, etc _> streaming
-
Prediction pipeline
- two stage (funnel) architecture
- candidate generation / retrieval service
- rule based
- filter and fetch unseen posts by users under certain criteria
- Ranking
- features -> model -> engagement prob. -> sort
- re-ranking: business logic, additional logic and filters (e.g. user interest category)
- candidate generation / retrieval service
- two stage (funnel) architecture
-
Continual learning pipeline
- fine tune on new data, eval, and deploy if improves metrics
- A/B Test
- Deployment and release
- Scaling (SW and ML systems)
- Monitoring
- Updates
- Viral posts / Celebrities posts
- New users (cold start)
- Positional data bias
- Update frequency
- calibration:
- fine-tuning predicted probabilities to align them with actual click probabilities
- data leakage:
- info from the test or eval dataset influences the training process
- target leakage, data contamination (from test to train set)
- catastrophic forgetting
- model trained on new data loses its ability to perform well on previously learned tasks