MovieMate is a personalized movie recommendation system built on the MovieLens 100K dataset, which contains 100,000 ratings from 943 users across 1,682 movies collected between September 1997 and April 1998. The system's goal is to surface movies a user is likely to enjoy, ranked by predicted preference. Its output serves end users directly through a recommendation feed, meaning that both accuracy (how well predicted ratings match actual preferences) and ranking quality (whether the best movies appear at the top of the list) matter.
The system offers three recommendation strategies: collaborative filtering (CF) using SVD matrix factorization, content-based filtering (CB) using TF-IDF genre profiles, and a rule-based (RB) baseline that recommends globally popular movies. A diversifier component can rerank any recommendation list to balance relevance with genre variety. A continuous learning module monitors error distributions to detect when the underlying data has shifted enough to warrant retraining.
Three design constraints shape every decision in this analysis. First, users span a wide range of history depths, from completely new users with zero ratings to long-term users with dozens of interactions. The system must serve all of them reasonably well. Second, the system should optimize for ranking quality over raw rating prediction, since users interact with ordered lists rather than individual predicted scores. Third, retraining costs real compute, so the update schedule should be justified by actual performance evidence rather than a fixed calendar.
This case analysis is structured as a sequence of connected design decisions. Each task builds directly on the previous one, and together they support a unified deployment recommendation at the end.
How should the MovieLens dataset be partitioned to faithfully simulate the real-world deployment environment, including cold-start users and newly released movies?
The central challenge in partitioning a recommendation dataset is avoiding temporal leakage. A random split would allow ratings from April 1998 to appear in the training set while earlier ratings land in the test set. In deployment, the model only ever sees past behavior and must recommend for the future, so any split that breaks this constraint produces an artificially optimistic evaluation.
The partition strategy uses the timestamp field in the ratings data to create a time-ordered 60/20/20 split: the first 60% of ratings by timestamp form the training set, the next 20% the validation set, and the final 20% the test set. Each boundary corresponds to a calendar date, so the training set covers activity up through roughly late January 1998, the validation set covers February through early March 1998, and the test set covers the remainder through April 1998.
After creating the splits, a cold-start analysis identified users and items that appear in the validation or test set without any history in the training set. These cases represent the hardest evaluation scenarios and directly motivate the hybrid strategy explored in Task 3.
Table 1: Partition Characteristics
| Partition | Ratings | Users | Movies | Avg Rating | Date Range |
|---|---|---|---|---|---|
| Train (60%) | 60,000 | 943 | 1,622 | 3.53 | Sep 1997 to Jan 1998 |
| Validation (20%) | 20,000 | 920 | 1,447 | 3.52 | Feb 1998 to Mar 1998 |
| Test (20%) | 20,000 | 884 | 1,307 | 3.54 | Mar 1998 to Apr 1998 |
Table 1 shows that the three partitions are similar in size and consistent in average rating. The slight decrease in unique users and movies across the validation and test partitions reflects the natural effect of temporal ordering: the most active users and most-rated movies accumulate their counts early. Crucially, the average rating stays stable across all three partitions at approximately 3.53, confirming that there is no systematic shift in rating behavior over the seven-month window.
Figure 1 (analysis/task1_fig1_monthly_volume.png): Monthly rating volume from September 1997 through April 1998. Activity peaks in November 1997 and remains relatively steady through the evaluation window, suggesting the data does not have strong seasonal spikes that would bias the partition.
Figure 2 (analysis/task1_fig2_split_overview.png): Top panel shows daily rating activity colored by partition, with red dashed lines marking the two split boundaries. Bottom panel compares the rating distribution across all three partitions. The near-identical distributions confirm that the temporal split does not introduce systematic bias.
The cold-start analysis revealed that a small number of users and items appear in the validation and test sets without any training history. These cases are real and important: collaborative filtering has no learned representation for them, and content-based filtering can only fall back to item metadata. They are explicitly tested in Task 3.
The 60/20/20 split is chosen to give the training set enough data for SVD to learn stable latent factors while preserving a meaningful evaluation window. A larger training set (for example 70/15/15) would benefit collaborative filtering but would shrink the cold-start population in validation, making Task 3's simulation less realistic. A smaller training set would do the opposite. The current split represents a reasonable middle ground for a seven-month dataset.
A time-ordered 60/20/20 split is the correct partitioning strategy for MovieMate. Random splits are not appropriate for recommendation system evaluation because they introduce temporal leakage and produce evaluations that do not reflect deployment reality. The chosen split preserves chronological ordering, exposes cold-start scenarios naturally, and maintains consistent rating distributions across all three partitions. These splits are reused in every subsequent task to ensure that all comparisons are made on the same held-out future data.
Which recommendation model should MovieMate deploy as its primary engine, and does performance differ meaningfully across user demographic groups?
Using the temporal partitions from Task 1, all three models were trained on the training set and evaluated on the test set. For collaborative filtering, SVD was trained directly on the temporal training data using the hyperparameters from the provided configuration (200 factors, 100 epochs, learning rate 0.01, regularization 0.1). For content-based filtering, TF-IDF genre profiles were built from training-set ratings only, ensuring no future information influenced user profiles. For rule-based, per-item average ratings were computed from the training set, with a global average fallback for unseen items.
Three metrics were used for evaluation. RMSE measures raw prediction accuracy. Precision@10 measures how many of the top-10 recommended items a user actually rated 4 or higher. NDCG@10 measures ranking quality by giving more credit to relevant items that appear near the top of the list. Ranking metrics were computed on a sample of 150 test users to keep runtime manageable.
For the segment analysis, test ratings were joined with the user demographic file (u.user), which contains age, gender, and occupation for each user. RMSE was computed separately for each demographic group using the same predict functions, with no additional retraining.
Table 2: Overall Model Performance on Temporal Test Set
| Model | RMSE | Precision@10 | NDCG@10 |
|---|---|---|---|
| Collaborative Filtering | 1.026 | 0.678 | 0.750 |
| Content-Based | 1.987 | 0.532 | 0.596 |
| Rule-Based | 1.044 | 0.671 | 0.738 |
Table 2 shows a more nuanced story than a simple ranking. Collaborative filtering achieves the best RMSE (1.026), but rule-based is remarkably close (1.044), a difference of less than 0.02, meaning both models predict ratings with nearly identical average accuracy. Where collaborative filtering actually pulls ahead is in ranking quality: Precision@10 of 0.678 and NDCG@10 of 0.750, compared to rule-based's 0.671 and 0.738. These gaps are small but meaningful. CF is placing the right movies slightly higher in the list, not just predicting ratings more accurately.
Content-based is clearly the weakest model across every metric. Its RMSE of 1.987 is nearly double collaborative filtering's, and its Precision@10 (0.532) and NDCG@10 (0.596) lag well behind both alternatives. Genre similarity alone turns out to be a poor proxy for actual user preferences, particularly when users rate films across genres inconsistently.
Figure 3 (analysis/task2_fig4_overall_comparison.png): Bar charts comparing RMSE, Precision@10, and NDCG@10 for all three models. Collaborative filtering leads on all three metrics. The near-identical RMSE between CF and rule-based stands out, as does the large gap for content-based.
Table 3: RMSE by Age Group
| Age Group | CF RMSE | CB RMSE | RB RMSE | N Ratings |
|---|---|---|---|---|
| Under 25 | 1.147 | 2.346 | 1.172 | 4,040 |
| 25-35 | 1.008 | 1.788 | 1.016 | 6,404 |
| 35-50 | 0.979 | 1.710 | 0.998 | 6,154 |
| Over 50 | 0.992 | 2.144 | 1.017 | 3,402 |
Table 4: RMSE by Gender
| Gender | CF RMSE | CB RMSE | RB RMSE | N Ratings |
|---|---|---|---|---|
| Female | 1.085 | 2.156 | 1.102 | 5,698 |
| Male | 1.002 | 1.933 | 1.020 | 14,302 |
Table 5: RMSE by Occupation (Top 5 by User Count)
| Occupation | CF RMSE | CB RMSE | RB RMSE | N Ratings |
|---|---|---|---|---|
| administrator | 0.997 | 2.032 | 1.022 | 1,057 |
| educator | 0.916 | 1.850 | 0.920 | 1,733 |
| engineer | 0.856 | 1.660 | 0.869 | 897 |
| other | 1.052 | 1.824 | 1.068 | 2,149 |
| student | 1.083 | 2.329 | 1.110 | 3,043 |
Tables 3 through 5 show that collaborative filtering's advantage holds across all demographic segments, but the margin varies in interesting ways. Users under 25 are the hardest group to serve: CF RMSE climbs to 1.147, compared to 0.979 for the 35-50 bracket. This suggests younger users have more eclectic rating patterns that the model's latent factors struggle to capture consistently. Engineers are the easiest segment to serve (CF RMSE 0.856), likely because their preferences are more genre-coherent. Students show the highest content-based RMSE of any group at 2.329, reinforcing that CB is especially unreliable for younger, more varied audiences.
Gender differences are small but consistent: CF RMSE is 1.085 for female users versus 1.002 for male users. This likely reflects the dataset's demographic imbalance (670 male users versus 273 female), which gives SVD more training signal to work with for male users.
Figure 4 (analysis/task2_fig5_segment_comparison.png): Three-panel bar chart showing RMSE by age group, gender, and occupation for all three models. Content-based consistently towers over CF and rule-based. CF's advantage is visible but narrow across all segments.
The ranking metrics (Precision@10 and NDCG@10) were computed by ranking each user's test items against each other, not against all 1,682 movies. This is a practical approximation given the evaluation cost, but it means the absolute values of these metrics are higher than they would be under a full-catalog ranking evaluation. The relative ordering between models is still valid.
The content-based model here uses only genre tags as item features. A richer representation incorporating movie titles, release years, or cast information might substantially improve its performance and change how it compares to collaborative filtering.
Collaborative filtering should be the primary model for users with sufficient rating history. Its real advantage over rule-based is in ranking quality rather than raw prediction accuracy, which matters more in practice since users interact with ordered recommendation lists, not individual scores. Content-based is not competitive as a standalone model at any metric, but its ability to generate predictions from genre metadata alone gives it a specific role as a cold-start fallback, which is explored in Task 3.
How does each model perform when a user has little or no rating history, and what is the best strategy for serving new users?
The cold-start problem occurs when the system must recommend to a user who has not yet built up enough history for collaborative filtering to learn a meaningful representation. To simulate this, 60 users with at least 60 total ratings were drawn from the dataset. For each user and each history level N in {0, 5, 10, 20, 50}, only the first N of their chronological ratings were treated as "known" history. Predictions were then evaluated on the ratings immediately following those N known interactions.
For content-based filtering, the user profile was rebuilt from scratch using only the N known ratings, directly replicating the genre-weighted averaging logic from the CB class. For rule-based, predictions are the item's average rating from the training set and are constant regardless of user history. For collaborative filtering, predictions use the pre-trained SVD model directly. This is an important caveat described in the limitations section below.
A hybrid strategy was also designed and evaluated: route users with zero ratings to rule-based, users with 1 to 19 ratings to content-based, and users with 20 or more ratings to collaborative filtering.
Table 6: RMSE by Model and Number of Known Ratings
| History | CF | CB | RB |
|---|---|---|---|
| N=0 | 0.801 | NaN | 1.014 |
| N=5 | 0.791 | 2.205 | 0.998 |
| N=10 | 0.809 | 2.094 | 1.007 |
| N=20 | 0.817 | 1.998 | 1.006 |
| N=50 | 0.746 | 1.966 | 0.967 |
Table 6 tells a clear story. Collaborative filtering is strong at every history level, hovering between 0.79 and 0.82 RMSE for N=0 through N=20, then improving to 0.746 at N=50. Rule-based sits steadily around 1.0 throughout, as expected since it never uses user history at all. Content-based starts at NaN with no history (a profile of all zeros produces no prediction), then jumps to an RMSE of 2.205 at just 5 ratings and only gradually improves to 1.966 at 50. Even with 50 ratings, content-based never comes close to matching CF or rule-based.
The hybrid strategy, rather than helping, actually backfires in the N=5 and N=10 range. By routing to content-based during those stages, the hybrid produces RMSE values of 2.20 and 2.09, which are worse than just using rule-based the whole time. The hybrid only catches up once CF takes over at N=20.
Figure 5 (analysis/task3_fig6_coldstart_curves.png): RMSE versus number of known ratings for all three individual models. CF is flat and consistently strong. RB is flat around 1.0. CB starts high and improves slowly but never catches up.
Figure 6 (analysis/task3_fig7_hybrid_comparison.png): The hybrid strategy line (in red) spikes sharply at N=5 and N=10 before recovering at N=20. This shows that routing to content-based during the early interaction phase harms performance.
The CF cold-start results here are not a true cold-start scenario. CF was trained on the full training set, so the users in this simulation already exist in the model's learned factor matrix. Their SVD representations were shaped by tens of ratings, not zero. A genuinely new user who arrives after training would not have a latent factor vector at all, and CF would fall back to global bias estimates that are much closer to rule-based territory. This is an important consideration for a live system where new users arrive continuously and cannot be included in a past training run.
The correct cold-start strategy for MovieMate is simpler than our originally proposed hybrid. Rule-based should serve users who are completely new to the system and not yet in the training data. Collaborative filtering should take over as soon as the user has any interaction history in the system. Content-based should not be used as a standalone recommendation model at any history level, given its consistently high RMSE even with 50 known ratings. The data in Table 6 makes this case clearly: routing to content-based at N=5 or N=10 produces results that are more than twice as bad as rule-based, with no recovery until collaborative filtering takes over.
How does the diversifier's alpha parameter affect the balance between recommendation relevance and genre variety, and what configuration should MovieMate use?
A recommendation list that only contains thrillers may be technically accurate for a thriller fan, but it offers no discovery and may bore the user over time. The diversifier addresses this by reranking the initial CF recommendation list using a weighted combination of relevance and diversity:
final_score = alpha * relevance + (1 - alpha) * diversity
At alpha=1.0 the ranking is purely relevance-driven, identical to the original CF output. At alpha=0.0 the ranking is driven entirely by genre diversity. Values in between balance the two.
The diversifier uses intra-list diversity (ILD) and rank-based diversity as its diversity measures. ILD measures the average pairwise cosine dissimilarity between genre vectors of the recommended items. A higher ILD means the list spans more different genres.
To evaluate the tradeoff, alpha was swept from 0.0 to 1.0 in steps of 0.1 across 40 test users. For each alpha, the top-20 CF recommendations were reranked, and the resulting top-10 list was evaluated for relevance (NDCG@10 against actual test ratings) and diversity (ILD against the genre vector matrix).
Table 7: Relevance and Diversity by Diversifier Strength
| Alpha | ILD | NDCG@10 |
|---|---|---|
| 0.0 | 0.6859 | 0.0959 |
| 0.1 | 0.6840 | 0.0959 |
| 0.2 | 0.6820 | 0.0959 |
| 0.3 | 0.6855 | 0.0959 |
| 0.4 | 0.6855 | 0.0959 |
| 0.5 | 0.6856 | 0.0959 |
| 0.6 | 0.6875 | 0.0959 |
| 0.7 | 0.6235 | 0.1054 |
| 0.8 | 0.6229 | 0.1080 |
| 0.9 | 0.5132 | 0.1117 |
| 1.0 | 0.5531 | 0.1296 |
Table 7 reveals that the relationship between alpha and the two metrics is not a smooth gradient. From alpha=0.0 through alpha=0.6, NDCG@10 is essentially flat at 0.096 and ILD stays high around 0.686. The diversifier in this range is heavily reranking for variety, but it is not meaningfully hurting relevance either. The real shift happens between alpha=0.7 and alpha=1.0, where NDCG climbs from 0.105 to 0.130 while ILD drops from 0.624 to 0.553. This is the zone where the system is genuinely trading variety for accuracy.
There is also a non-monotonic dip at alpha=0.9, which produces the lowest ILD of any setting at 0.513, lower even than the pure-relevance setting at alpha=1.0 (0.553). This points to some interaction between the diversifier's scoring and the CF relevance scores at that specific weight that produces an unusual reranking. Alpha=0.9 is a poor choice despite its decent NDCG.
Figure 7 (analysis/task4_fig8_diversifier_tradeoff.png): Left panel shows both metrics across the full alpha range. Right panel shows the diversity-relevance tradeoff curve with each alpha annotated. The cluster of points from alpha=0.0 to 0.6 sits together at high ILD and low NDCG, while alpha=0.7 through 1.0 spread out toward higher NDCG and lower ILD.
The NDCG values here are noticeably lower in absolute terms than the values from Task 2. This is because NDCG in Task 4 is computed only across the test items for each user rather than over a broader item pool, and the set of 40 evaluation users may differ from the Task 2 sample. The comparison between alpha values is internally valid, but the absolute NDCG numbers should not be compared directly to Table 2.
Alpha=0.8 is the recommended deployment setting. It achieves an NDCG@10 of 0.108, capturing most of the relevance gains available in the 0.7 to 1.0 range, while maintaining an ILD of 0.623 that is meaningfully higher than what pure relevance at alpha=1.0 produces (0.553). The flat NDCG region below alpha=0.7 also suggests that users can be offered an "explore" mode at alpha=0.5 or below with essentially no relevance cost, while a "more of what I love" mode could push toward alpha=1.0 for users who prefer familiar genres. The non-monotonic behavior at alpha=0.9 makes it the one value to avoid.
How should MovieMate decide when to retrain its models, and does a KS test-based trigger outperform a fixed monthly schedule?
A model trained on September through January data may not perform as well in April as it did in February, if user behavior or content availability has shifted over time. To test this, the evaluation window (February through April 1998) was divided into four monthly buckets. The initial CF model was evaluated on each monthly bucket without any retraining, producing per-month RMSE values that show whether performance changes over time.
Two update strategies were then compared against this no-retraining baseline. Monthly retraining incorporates each new month's data into the training pool and retrains before evaluating that month. KS-triggered retraining uses the ContinuousLearner class to compare the distribution of per-prediction squared errors in each month against the baseline distribution from the first evaluation month (January 1998). When the KS test returns a p-value below 0.05, indicating a statistically significant shift in the error distribution, the model retrains on all accumulated data.
Table 8: RMSE Comparison Across Update Strategies
| Month | No Retraining | KS-Triggered | Monthly Retrain | KS p-value | KS Retrained |
|---|---|---|---|---|---|
| 1998-01 | 1.1727 | 1.1727 | 0.6667 | 1.0000 | No |
| 1998-02 | 1.0317 | 1.0317 | 0.6744 | 0.8435 | No |
| 1998-03 | 1.0177 | 1.0177 | 0.6775 | 0.9927 | No |
| 1998-04 | 1.0409 | 1.0409 | 0.6984 | 0.4366 | No |
Table 8 shows three findings worth examining carefully. First, the no-retraining RMSE actually improves over time rather than degrading, dropping from 1.1727 in January to 1.0177 in March before ticking slightly back up to 1.0409 in April. This means there is no evidence of classic drift-driven degradation over the four-month window. The model appears to have been slightly miscalibrated for the early evaluation months, with later months happening to align better with what it learned.
Second, the KS test never triggered a retrain. All p-values remained well above the 0.05 threshold throughout, with the closest call being 0.4366 in April. Since the error distributions across months were not statistically different enough to flag, KS-triggered retraining and no retraining produced identical results across all four months.
Third, and most importantly, monthly retraining tells a completely different story. By incorporating each month's new ratings before evaluating on it, the model achieves RMSE values between 0.667 and 0.698, roughly 0.35 lower than no retraining. This improvement does not come from correcting drift. It comes from the model having access to more data overall, which makes it better calibrated.
Figure 8 (analysis/task5_fig10_drift_comparison.png): Top panel shows RMSE over four months for all three strategies. Monthly retraining runs roughly 0.35 RMSE below the other two lines, which are identical throughout. Bottom panel shows KS p-values over time. All values remain far above the 0.05 threshold, confirming no drift was detected.
This experiment covers only four months of evaluation data, which limits the conclusions that can be drawn about long-term drift. Over a longer window or in a system with faster-moving content (new releases, trending topics, seasonal preferences), the KS test might detect meaningful distributional shifts and earn its place. The MovieLens dataset is relatively stable by nature: most of the movies are older releases, user preferences are fairly settled, and the seven-month collection window is not long enough to capture major behavioral shifts.
Monthly retraining is the recommended update policy for MovieMate. The performance gap between monthly retraining and no retraining is large (roughly 35% lower RMSE) and consistent across all four evaluation months. The KS test did not provide useful signal in this experiment. This is most likely because MovieLens is too small and too stable for the test to detect meaningful distribution shifts over a four-month window. For the current system as evaluated, a fixed monthly retraining schedule is simpler, more reliable, and clearly more effective than waiting for a statistical trigger that may never even fire.
This case analysis followed a connected sequence of design decisions for the MovieMate recommendation system:
-
A time-ordered 60/20/20 split was established as the correct evaluation framework. Random splits introduce temporal leakage that makes models look better than they actually are in deployment. The temporal split exposes cold-start users and items naturally and maintains consistent rating distributions across all three partitions.
-
Collaborative filtering is the primary recommendation model. It achieves the best RMSE (1.026) and the highest ranking quality (Precision@10 = 0.678, NDCG@10 = 0.750). Its true advantage over rule-based is in ranking rather than raw accuracy, which is what matters since users see ordered lists. Content-based filtering performs poorly across all metrics and all demographic segments, with RMSE of 1.987 and Precision@10 of 0.532.
-
The cold-start strategy is rule-based first, then collaborative filtering. Content-based filtering, despite being the intuitive choice for new users, actively harms performance in the early interaction phase (RMSE of 2.20 at N=5 versus 1.01 for rule-based). The revised hybrid routes cold users to rule-based and switches to collaborative filtering as soon as history accumulates in the system.
-
The diversifier should be deployed at alpha=0.8. This setting preserves most of the relevance gains available at alpha=1.0 while maintaining meaningfully higher genre diversity than pure-relevance ranking. The flat NDCG region below alpha=0.7 makes aggressive diversification a safe option for users who want discovery, at essentially no accuracy cost.
-
Monthly retraining is the right update schedule. It consistently achieves RMSE around 0.67 to 0.70, roughly 35% lower than no retraining. The KS test did not detect any drift over the four-month evaluation window and never triggered a retrain, likely because MovieLens is too stable a dataset for the test to find meaningful distributional shifts. In a production system with a faster-moving catalog, revisiting the KS approach would be worthwhile.
Across all five tasks, the most consistent finding is that collaborative filtering's performance scales with interaction history. It works well for established users, requires a rule-based fallback for cold users, and benefits significantly from regular retraining on accumulated data. Building MovieMate around CF as the core engine, with rule-based as a safety net and a diversifier as a user-facing feature, is the deployment architecture that best fits the evidence.