A content-based movie recommendation system that suggests movies similar to a user's selection. The engine uses Natural Language Processing (NLP) to analyze movie metadata (genres, cast, crew, and keywords) and serves recommendations via a Streamlit web application.
This project processes the TMDB 5000 Movie Dataset to create a recommendation algorithm. Instead of using user ratings, it focuses on the content of the movies themselves.
- Data Processing: Cleans and merges datasets, extracts key features, and creates a unified "tag" system for every movie.
- Machine Learning: Uses
CountVectorizerto convert text tags into vectors and calculates Cosine Similarity to find the closest matches in a 5000-dimensional space. - Web App: A user-friendly interface built with Streamlit that displays movie recommendations and fetches real-time posters from the TMDB API.
- Python 3.x
- Pandas & NumPy: Data manipulation and analysis.
- Scikit-learn: Used for
CountVectorizerandcosine_similarity. - NLTK: Used
PorterStemmerto reduce words to their root form (e.g., "dancing" → "danc"). - Streamlit: Frontend framework for the web application.
- TMDB API: Used to fetch movie posters dynamically.
movie-recommender-system.ipynb: Jupyter Notebook containing the data preprocessing pipeline, vectorization, and model generation.app.py: The main Streamlit application script.tmdb_5000_movies.csv: Metadata dataset (budget, overview, popularity, etc.).tmdb_5000_credits.csv: Credits dataset (cast, crew).movie.pkl: (Generated) Pickled dataframe containing movie titles and tags.similarity.pkl: (Generated) Pickled cosine similarity matrix.
- Merging: The movies and credits datasets are merged on the movie title.
- Feature Extraction:
- Genres & Keywords: Extracted from JSON format.
- Cast: Top 3 actors are extracted.
- Crew: The Director is isolated.
- Text Cleaning: Spaces are removed from names (e.g., "Sam Worthington" becomes "SamWorthington") to create unique vector tokens.
- Vectorization: A
tagscolumn is created by combining the overview, genres, keywords, cast, and crew. This text is stemmed and vectorized using a Bag-of-Words approach (5000 most frequent words). - Model Export: The resulting dataframe and similarity matrix are exported as
.pklfiles for the app to use.
The app loads the pre-trained models and provides a dropdown menu for movie selection. When the "Recommend" button is clicked, the system:
- Finds the index of the selected movie.
- Retrieves the 5 most similar movies based on the cosine similarity matrix.
- Fetches poster URLs using the TMDB API.
- Displays the titles and posters in a 5-column grid.
-
Clone the repository:
git clone [https://github.com/yourusername/movie-recommender-system.git](https://github.com/yourusername/movie-recommender-system.git)
-
Install dependencies:
pip install streamlit pandas numpy scikit-learn nltk requests
-
Generate Models: Run the Jupyter Notebook to generate the necessary pickle files. Open
movie-recommender-system.ipynbin Jupyter and run all cells. This will createmovie.pklandsimilarity.pkl. -
Run the App:
streamlit run app.py
The app.py file contains an authorization bearer token for the TMDB API to fetch posters.