This project is a Content-Based Movie Recommendation System that uses movie metadata from the TMDB 5000 dataset to recommend movies similar to a given title. The recommendation engine is built using techniques such as text preprocessing, stemming, and cosine similarity. Try out here: https://cinemarecommendation.streamlit.app/
The dataset used for this project includes:
tmdb_5000_movies.csv: Contains various details about movies like genres, keywords, overview, etc.tmdb_5000_credits.csv: Includes information about the cast and crew.
-
Data Preprocessing:
- Merged the
moviesandcreditsdatasets on theidcolumn to create a single DataFrame (new_movies). - Dropped unnecessary columns, keeping only the ones relevant for recommendations (
id,original_title,genres,keywords,cast,crew,overview).
- Merged the
-
Feature Engineering:
- Combined
overview,genres,keywords,cast, andcrewcolumns into a singletagscolumn, which serves as the main input for the recommendation model. - Used the following steps to populate the
tagscolumn:- Extracted genres, keywords, and cast/crew names as lists.
- Limited the cast to the top three actors and extracted only the director’s name.
- Removed spaces in multi-word names to avoid confusion during training.
- Combined
-
Text Vectorization:
- Transformed the
tagscolumn into a matrix of token counts using theCountVectorizerwith a maximum of 5000 features and English stop words removed. - Applied stemming using NLTK's
PorterStemmerto reduce similar words to their root form.
- Transformed the
-
Cosine Similarity Calculation:
- Calculated cosine similarity between all movies based on the vectorized
tags, allowing us to measure the similarity between movies.
- Calculated cosine similarity between all movies based on the vectorized
-
Recommendation Function:
- Built a
recommend()function that, given a movie title, retrieves the top 5 most similar movies based on cosine similarity.
- Built a
-
Saving Model and Data:
- Used the
picklelibrary to save the processed movie details and similarity matrix for easy reuse.
- Used the
The project includes an example for generating recommendations. To use the recommendation function, load the movie_list.pkl and similarity.pkl files and call recommend() with a movie title:
import pickle
# Load files
movies_df = pickle.load(open('movie_list.pkl', 'rb'))
similarity = pickle.load(open('similarity.pkl', 'rb'))
# Example: Recommend movies similar to "Baby's Day Out"
recommend("Baby's Day Out")- Install the required libraries:
pip install numpy pandas matplotlib seaborn sklearn nltk
- Download
tmdb_5000_movies.csvandtmdb_5000_credits.csvdatasets. - Run the notebook or script provided to process data, train the model, and generate recommendations.
- Python 3.x
- Libraries:
numpy,pandas,matplotlib,seaborn,scikit-learn,nltk,pickle
Data used in this project is from The Movie Database (TMDB).