This project is a Content-Based Movie Recommendation System that uses movie metadata from the TMDB 5000 dataset to recommend movies similar to a given title. The recommendation engine is built using techniques such as text preprocessing, stemming, and cosine similarity. Try out here: https://cinemarecommendation.streamlit.app/
The dataset used for this project includes:
tmdb_5000_movies.csv
: Contains various details about movies like genres, keywords, overview, etc.tmdb_5000_credits.csv
: Includes information about the cast and crew.
-
Data Preprocessing:
- Merged the
movies
andcredits
datasets on theid
column to create a single DataFrame (new_movies
). - Dropped unnecessary columns, keeping only the ones relevant for recommendations (
id
,original_title
,genres
,keywords
,cast
,crew
,overview
).
- Merged the
-
Feature Engineering:
- Combined
overview
,genres
,keywords
,cast
, andcrew
columns into a singletags
column, which serves as the main input for the recommendation model. - Used the following steps to populate the
tags
column:- Extracted genres, keywords, and cast/crew names as lists.
- Limited the cast to the top three actors and extracted only the director’s name.
- Removed spaces in multi-word names to avoid confusion during training.
- Combined
-
Text Vectorization:
- Transformed the
tags
column into a matrix of token counts using theCountVectorizer
with a maximum of 5000 features and English stop words removed. - Applied stemming using NLTK's
PorterStemmer
to reduce similar words to their root form.
- Transformed the
-
Cosine Similarity Calculation:
- Calculated cosine similarity between all movies based on the vectorized
tags
, allowing us to measure the similarity between movies.
- Calculated cosine similarity between all movies based on the vectorized
-
Recommendation Function:
- Built a
recommend()
function that, given a movie title, retrieves the top 5 most similar movies based on cosine similarity.
- Built a
-
Saving Model and Data:
- Used the
pickle
library to save the processed movie details and similarity matrix for easy reuse.
- Used the
The project includes an example for generating recommendations. To use the recommendation function, load the movie_list.pkl
and similarity.pkl
files and call recommend()
with a movie title:
import pickle
# Load files
movies_df = pickle.load(open('movie_list.pkl', 'rb'))
similarity = pickle.load(open('similarity.pkl', 'rb'))
# Example: Recommend movies similar to "Baby's Day Out"
recommend("Baby's Day Out")
- Install the required libraries:
pip install numpy pandas matplotlib seaborn sklearn nltk
- Download
tmdb_5000_movies.csv
andtmdb_5000_credits.csv
datasets. - Run the notebook or script provided to process data, train the model, and generate recommendations.
- Python 3.x
- Libraries:
numpy
,pandas
,matplotlib
,seaborn
,scikit-learn
,nltk
,pickle
Data used in this project is from The Movie Database (TMDB).