- Sandhiya Sukumaran @sandhiyaaa
- Koh Zhi En @zex3
- Yap Shen Hwei @imaginaryBuddy
Our project is based on IMDB 5000 dataset found on kaggle
- Introduction
- Problem Statement
- Motivation
- Steps:
- 1 : Looking at the Dataset
- Our Hypotheses
- 2 : Data Extraction & Cleaning
- 3 : Exploratory Data Analysis
- 4 : Machine Learning
- 1 : Looking at the Dataset
- Interesting Questions
- The Big Conclusion
- Beyond Our Course
- Limitations and Discussion
- Workload Delegation
- Our Video
- References
Have you ever wondered why some movies are more successful than others? If you're a movie director, you've came to the right place! If you are not a movie director, of course, you can still read on to find out more!
Identify which features contribute to the success of a movie.
Give directors a better estimation on how to maximize the success rate of their movie
variable | dtype | variable | dtype | |
---|---|---|---|---|
color | object | actor_3_name | object | |
director_name | object | facenumber_in_poster | float64 | |
num_critic_for_reviews | float64 | plot_keywords | object | |
duration | float64 | movie_imdb_link | object | |
director_facebook_likes | float64 | num_user_for_reviews | float64 | |
actor_3_facebook_likes | float64 | language | object | |
actor_2_name | object | country | object | |
actor_1_facebook_likes | float64 | content_rating | object | |
gross | float64 | budget | float64 | |
genres | object | title_year | float64 | |
actor_1_name | object | actor_2_facebook_likes | float64 | |
movie_title | object | imdb_score | float64 | |
num_voted_users | int64 | aspect_ratio | float64 | |
cast_total_facebook_likes | int64 | movie_facebook_likes | int64 |
There are a total of 28 variables.
- Duration will not affect IMDB scores
- Variables related to popularity will have positive correlation with IMDB score
- Budget will affect IMDB score
We split our dataset into 80:20 ratio, then used the 80% as our Train Dataset to further divide to obtain our Train: Validate for our Machine Learning Models in 80 : 20 ratio.
- Issue with
gross
- Issue with
budget
- Different movies from different countries had different currencies for their budget.
- Since the proportion of movies from other countries (besides US) was quite small, we decided to drop them.
- We only used movies from USA.
- Needed to standardize budget base on 2016 inflation rates in the US.
- Null Values
- When # of null values are small for the variable, we dropped them.
- Otherwise, for numerical data, we replaced them with median in scenarios such as during Machine Learning.
- For categorical data, we dropped the rows.
- Train : Validate : Test
- We followed the Train : Validate : Test scheme
- Split Train:Test in 80:20 ratio
- Used Train as our EDA
- Further split Train into Train:Validate in 80:20 ratio for Machine Learning
- Binning
imdb_scores
- We wanted to observe the correlation not just in a numerical manner but also in a categorical manner.
- Besides, since we couldn't really find any strong linear correlation (as you will read later on), we figured that it would be beneficial to split
imdb_score
into categories.
# Bins to categorise the imdb_score ranges
# Multi bins
imdb_bins = [0, 3, 5, 7, 10]
imdb_labels = ["horrendous", "ok", "good", "very good"]
# Binary bins
# 6.5 = 1-6.5 (Bad) 10 = 6.6-10 (Good)
bins = (2, 6.5, 10)
TLDR:
- we only used movies from the US, and standardised the budget based on 2016 inflation rates.
- used Train : Validate : Test scheme
- removed gross entirely due to inconsistency
- we binned the
imdb_score
into categories, and tried out different bins.
In this section, we will look at univariate and bivariate EDAs concerning more significant/ interesting variables.
We have chosen imdb_score
as our main response variable, for simplicity purposes. Initally, we wanted to use gross
, but due to disparities, we decided not to.
these are the most frequently appeared directors.
director_name | count |
---|---|
Steven Spielberg | 22 |
Woody Allen | 18 |
Clint Eastwood | 17 |
Spike Lee | 15 |
Ridley Scott | 15 |
Martin Scorsese | 15 |
Steven Soderbergh | 12 |
Renny Harlin | 12 |
Robert Zemeckis | 12 |
It is interesting to note that Steven Spielberg is also one of directors from the Top20 performing movies
- a large proportion of movies receive close to 0 num_critic_for_reviews.
- there is no significant linear correlation bewteen
num_critic_for_reviews
andimdb_score
- the table below shows the movies sorted based on their
num_critic_for_reviews
, it does seem to show thatimdb_score
falls in a range of > 7.0 for these 20 movies.
- to be fair, it may be that there is some sort of indication for imdb_score based on num_critic_for_reviews (as shown on the table), perhaps due to the large proportion of data receiving close to 0 reviews, we couldn't observe a linear correlation.
duration vs imdb_score
- We binned the
imdb_score
into categories to formscore_cat
- There seems to be slight correlation based on the boxplot between
duration
andscore_cat
- Was extremely right-skewed even after removing the outliers, which is not unexpected, since "success" depends on outliers.
- Due to the skew structure, we used log transform to visualise the data.
- Distribution after log transform:
- Binomial distribution, suggesting that there may be two different "clusters".
director_facebook_likes vs imdb_score
Although it can't be confirmed that there is a correlation between them, the boxplots shows that the median values of imdb_score do vary for the different categories.
However, we note that the "good" and "very good" categories had relatively larger numbers of outliers, that had larger director_facebook_likes
, this could possibly suggest that there is some correlation if we split them into subgroups to observe. (as we recall that there is binomial distribution)
We had to split the strings into individual genres.
Observations
-
Most common genre : Drama
-
Is it because it is the most profitable?
-
this formed our hypothesis that: assuming that the movies industry follows demand and supply, there is high demand for Dramas, so this genre will be the most popular with the highest ratings amongst the other genres.
genres vs mean imdb_scores
We calculated the mean imdb_scores for each genre.
The results :
It seems that Film-Noir has the highest imdb_score
, however, this is inaccurate, as later, we find out that there were only 5 Film-Noir movies contributing to this observation.
As noted here:
We decided to use only movies produced in the USA, so we could standardize the budget based on CPI (referenced: https://aarya1995.github.io/)
We performed web scraping using BeautifulSoup to obtain CPI data. Then, we updated the budget column of the whole dataset.
from bs4 import BeautifulSoup
import requests
budget vs imdb_score
Initially, we couldn't really see any pattern with only 2 and 4 imdb_score bins. So we split into 5 bins and saw a clearer picture. It does seem like higher budget can influence imdb_score. However, for the "horrendous" category, it seems like the budget used on them is higher.
This could mean that although budget does follow a certain trend as imdb_score increases, we ought to be careful with our budget as there is still a risk of the movie turning out to be "horrendous"
# the new bins (5 categories) we used
# bins = [1,3,4,6,9,10], labels = ["horrendous", "very bad", "bad", "ok", "good"]
- Positively-skewed, large proportion had no number of voted users
- Not much linear correlation either : with a correlation of -> 0.470567
A large proportion has imdb score of around 5-8.
The median of imdb_score is 6.5, which is why we chose one of our bins to be [0. 6.6, 10] (i.e. 0-6.5 will be classfied as "bad" and 6.6-10 as "good")
The heat map shows that some variables affecting imdb_score are:num_critic_for_reviews
duration
num_voted_users
num_user_for_reviews
movie_facebook likes
We explored several ML Models, the best-performing ML Model for our dataset turned out to be ..... Random Forest!
The list of models we used were:
- Linear Regression
- Logistic Regression
- K-Means
- Decision Tree
- Random Forest (Main)
- As expected, since our dataset is highly categorically-inclined, linear regression for both bivariate and multivariate LR had low R2 and MSE scores.
- Below shows the scores of some bivariate LR that we attempted
- Multivariate LR
- Logistic Regression showed slightly better results, however, accuracy scores were not that high either.
- This implies that there is not a "clear cut" between the datas.
- Since there is some improvements, maybe a decision tree would show better results.
- We performed Multivariate and Multiclass Logistic Regression
-
For mulvariate logR, scaling improved the accuracy score from 0.51 to 0.66
-
K-folds also improved:
- multivariate accuracy scores from 0.51 to 0.63
- multiclass accuracy scores from 0.65 to 0.66 (very slightly)
-
Multiclass logRegression showed better scores than binomial logRegression.
-
We also used other metrics like F1 scores and precision to observe our model.
K-means is an unsupervised machine learning model. We found out that the optimal number of clusters is 3 (using elbow method)
The 2-D Grid, Parallel Coordinates Plot and Boxplot all show that budget
is a huge determinant in influencing the split between the clusters!
Train vs Validation results: had relatively good performance with 0.7 - 0.83 accuracy. This further confirms that our dataset is highly categorically inclined. However, train data had slightly better accuracy compared to validation, indicating that there may be slight overfitting issues.
Nevertheless, since the performance was good, we decided to use dectree on our Test dataset. Below shows the results.
Random Forest was the best! (Although, again, there may be slight overfitting for the same reasons as dectree + it took quite long to load)
Accuracies of :
- Train Data = 0.96
- Validation Data = 0.82
- Test Data = 0.99
Feature importance in random forest shows how important each feature is in determining the decision the tree makes
Below shows the feature importance for determining imdb_score
.
Turned out that num_voted_users
, duration
, num_user_for_reviews
, num_critic_for_reviews
and budget
are the top5 determinants.
It is interesting to note that the variables that indicate popularity are : num_voted_users
, num_user_for_reviews
, and num_critic_for_reviews
, and it is not unexpected for them to be determinants of success (imdb_scores).
Unfortunately, as much as we wanted to see some correlation, our bivariate EDA tells us that there isn't any correlation. See the boxplot below! However though, there is a somewhat normal distribution in the data, with a median length of 13
We looked into the Top20 imdb_score movies and searched for their personalities online. Here are the results !
index | director_name | personality type | index | director_name | personality type | |
---|---|---|---|---|---|---|
0 | Frank Darabont | INFP | 10 | David Fincher | INTJ | |
1 | Francis Ford Coppola | INTJ | 11 | Christopher Nolan | INTJ | |
2 | John Stockwell | INFP | 12 | Peter Jackson | ENFP | |
3 | Christopher Nolan | INTJ | 13 | Irvin Kershner | INTP | |
4 | Francis Ford Coppola | INTJ | 14 | Mitchell Altieri | n/a | |
5 | Peter Jackson | ENFJ | 15 | Lana Wachowski | ENFP | |
6 | Sergio Leone | n/a | 16 | Cary Bell | n/a | |
7 | Steven Spielberg | ISFP | 17 | Fernando Meirelles | INFP | |
8 | Quentin Tarantino | ENTP | 18 | Milos Forman | INTP | |
9 | Robert Zemeckis | ENFP | 19 | Akira Kurosawa | INFJ |
Observations: almost all of them (except for one - Steven Spielberg) have "N" in their personalities, which is the intuitive element.
Do you, as a movie director, have these personality traits too?
- Our outcomes show that decision tree and random forest are the most suitable machine learning models for our data set.
- This may be due to our dataset having skewed and imbalance data. Also, our dataset does not have very good linear relationships.
- Duration and budget of the movie are the top 5 features affecting imdb score.
- Popularity of the director and cast plays a role in determining imdb score.
- The top 3 genres affecting imdb score is drama, comedy and action. This aligns with our bi-variate eda as drama is one of the most representation genres affecting imdb_score.
So a movie director should pay close attention to the aforementioned factors.
Generally, based on our EDA and ML, movies with the following attributes will do better on the imdb rating score:
- Higher duration
- Higher budget
- More popular director and cast
- Movies with the genres of drama, comedy and/or action
-
Standardising budget to 2016 inflation rate as the latest movies only go up to 2016
-
Web scraping
-
Visualisations:
- 3D scatter plot & word cloud
-
Machine Learning:
-
K-modes & K-means
-
Logistic Regression
- using Scaler() from sklearn
-
Random Forest
-
Feature Importance
-
Metrics
- Analysis of personalities of the directors may be biased because they may be classified as those personalities based on their careers. Therefore, it may not be an accurate representation. However, it is still interesting to note their personalities!
- Further analysis can be done on other variables that indicate success through popularity or movie like
director_facebook_likes
,num_critic_for_reviews
,num_voted_users
- Our dataset is quite imbalanced and skewed, therefore a larger dataset may help.
- ML : KMeans, Decision Tree
- Presentation
- EDA
- Codes for EDA : EDA on last 9 Variables
- Data Visualisation
- ML : Random Forest, Linear Regression
- Presentation
- EDA
- Codes for EDA : EDA on mid 9 Variables
- Data Visualisation
- ML : Logistic Regression
- Presentation
- EDA
- Codes for EDA : EDA on first 9 Variables
- Github
- Answering Interesting Questions : Codes here
- https://aarya1995.github.io/
- https://www.kaggle.com/code/carolzhangdc/predict-imdb-score-with-data-mining-algorithms
- https://www.kaggle.com/code/niklasdonges/end-to-end-project-with-python/notebook
- https://medium.com/@kohlishivam5522/understanding-a-classification-report-for-your-machine-learning-model-88815e2ce397
- https://www.analyticsvidhya.com/blog/2020/03/one-hot-encoding-vs-label-encoding-using-scikit-learn/
- http://rstudio-pubs-static.s3.amazonaws.com/342210_7c8d57cfdd784cf58dc077d3eb7a2ca3.html#conclusion
- https://scikit-learn.org/stable/modules/impute.html
- https://www.datacamp.com/community/tutorials/wordcloud-python
- https://machinelearningmastery.com
- https://www.bespeaking.com/wp-content/uploads/2019/09/Movie-vocab.jpg