Modeling and Prediction for Movie Audience Ratings
Synopsis
One of the key issues facing film production companies is, will the production company make a profit from a movie. It is assumed that favorable audience reviews will in-turn lead to higher ticket sales or DVD sales, both items directly affect a movie’s profitability.
The analysis will look at what attributes lead to a higher average audience review score on the public website, Rotten Tomatoes
Spoiler Alert The analysis creates a model that is close, but isn’t 100% confident.
Part 1: Data
Data Set Source
Movie Information was provided from three websites, Internet Movie Database, referred to as IMDB, Rotten Tomatoes, and Box Office Mojo
IMDB and Rotten Tomatoes websites allow for general public to submit their opinion on a given movie. However there is no validation on if the person submitting the opinion actually saw the movie or the opinion is completely unsolicited. The reviews are not limited to one country, and the reviews are not limited by age. Review collection is not conducted in a scientific manner, and it amounts to a popularity opinion poll. The collection is limited to general population who have internet access, visited the aforementioned website(s), and are aware that they can submit a review for a specific movie.
With that said it is advisable that the results can not be generalized to the entire general movie-going population. The analysis will not try to establish a causal relationship between the variables as there was no random assignment for explanatory and response variables.
An archived version of movie informational data was used. Movie Information Source
Load packages
1 2 3 4 5 6 7 |
# load standard packages in R library(dplyr) library(ggplot2) library(plotly) library(BAS) library(MASS) |
Load data
1 2 3 |
# data provided for the rubric load("movies.Rdata") |
Data Dictionary
A brief description of the fields used in the analysis are listed in the following section.
field names | field types | field description | calculated field | |
---|---|---|---|---|
2 | feature_film | char | Is movie a feature film | Y |
3 | is_drama | char | Genre of the movie is Drama | Y |
4 | runtime | int | Runtime of movie in minutes | N |
5 | mpaa_rating_r | char | MPAA rating is R | Y |
6 | thtr_rel_year | int | Year the movie is released in theaters | N |
7 | oscar_season | char | Month movie is release in theaters in October, November, or December | Y |
8 | summer_season | char | Month movie is release in theaters in May, June, July, August | Y |
9 | imdb_rating | int | Rating on IMDB. Rating on IMDB on a scale of 1-10; 10 being highest. http://www.imdb.com | N |
10 | imdb_num_votes | int | Number of votes on IMDB | N |
11 | critics_score | int | Critics score on Rotten Tomatoes | N |
12 | best_pic_nom | char | Whether or not the movie was nominated for a best picture Oscar: yes, no | N |
13 | best_pic_win | char | Whether or not the movie won a best picture Oscar: yes, no | N |
14 | best_actor_win | char | Whether or not one of the main actors in the movie ever won an Oscar: yes, no | N |
15 | best_actress_win | char | Whether or not one of the main actresses in the movie ever won an Oscar: yes, no | N |
16 | best_dir_win | char | Whether or not the director of the movie ever won an Oscar: yes, no | N |
17 | top200_box | int | Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo | N |
18 | audience_score | int | Audience score on Rotten Tomatoes | N |
Part 2: Data Processing
Create Calculated Columns
Create the following calculated columns:
+ feature_film: Use the variable title_type to assign yes or no if title_type == “Feature Film”
+ is_drama: use the variable genre, assign yes or no if genre == “Drama”
+ mpaa_rating_r: use the variable mpaa_rating, assign yes or no if mpaa_rating == “R”
+ oscar_season: use the variable thtr_rel_month, if value is in 10,11,12 assign yes, otherwise no.
+ summer_season: use the variable thtr_rel_month, if value is in 5,6,7,8 assign yes, otherwise no.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# add feature_film variable movies <- movies %>% mutate(feature_film = as.factor(if_else(title_type == "Feature Film", "Yes" , "No"))) # add is_drama variable movies <- movies %>% mutate(mpaa_rating_r = as.factor(if_else(mpaa_rating == "R", "Yes" ,"No"))) # add is_drama variable movies <- movies %>% mutate(is_drama = as.factor(if_else(genre == "Drama", "Yes" ,"No"))) # add oscar_season variable movies <- movies %>% mutate(oscar_season = as.factor(if_else(between(thtr_rel_month, 10,12), "Yes" , "No"))) # add summer_season variable movies <- movies %>% mutate(summer_season = as.factor(if_else(between(thtr_rel_month, 5,8), "Yes" , "No"))) |
Subsetting the Data
- Select the columns used for the analysis
- runtime, thtr_rel_year ,imdb_rating ,imdb_num_votes ,critics_score ,top200_box
- best_pic_nom ,best_pic_win ,best_actor_win ,best_actress_win ,best_dir_win
- audience_score
- feature_film ,is_drama ,mpaa_rating_r ,oscar_season ,summer_season
- Remove any observations that have NA values
1 2 3 4 5 6 7 8 9 10 11 12 |
# 1. Select applicable columns for analysis movies_subset <- movies %>% dplyr::select(runtime, thtr_rel_year ,imdb_rating ,imdb_num_votes ,critics_score ,top200_box ,best_pic_nom ,best_pic_win ,best_actor_win ,best_actress_win ,best_dir_win ,audience_score ,feature_film ,is_drama ,mpaa_rating_r ,oscar_season ,summer_season ) #2. remove the obs with NA values movies_subset <- na.omit(movies_subset) |
Remove observations
- Extract observations for feature film’s only; TV Movies are ineligible for an Oscar, documentaries do not have actors, and top 200 variable would not be applicable.
1 2 3 4 5 |
#count of observations being removed non_feature_films <- sum(movies$title_type != "Feature Film" | movies$genre == "Documentary") # remove observations from the dataset movies_subset <- movies_subset %>% filter(movies_subset$feature_film == "Yes") |
Non feature films removed from the observations: 63
Part 3: Exploratory data analysis
Exploratory data analysis will look at the calculated fields that were created in the data processing step:
- feature_film
- is_drama
- mpaa_rating_r
- oscar_season
- summer_season
Additionally I added the other variables that have binomial values for comparisons.
Summary Statistics:
binominal variables
A summary tables of all variables that have a binomial value.
1 2 3 4 5 |
movies_subset %>% dplyr::select(top200_box,best_pic_nom,best_pic_win ,best_actor_win,best_actress_win,best_dir_win ,is_drama,mpaa_rating_r, oscar_season , summer_season) %>% summary() |
1 2 3 4 5 6 |
## top200_box best_pic_nom best_pic_win best_actor_win best_actress_win ## no :576 no :569 no :584 no :500 no :521 ## yes: 15 yes: 22 yes: 7 yes: 91 yes: 70 ## best_dir_win is_drama mpaa_rating_r oscar_season summer_season ## no :548 No :290 No :274 No :415 No :397 ## yes: 43 Yes:301 Yes:317 Yes:176 Yes:194 |
top200_box | count | avg_score | avg_score_delta |
---|---|---|---|
no | 576 | 60.09896 | -19.37 |
yes | 15 | 74.53333 | 24.02 |
best_pic_nom | count | avg_score | avg_score_delta |
---|---|---|---|
no | 569 | 59.50439 | -30.26 |
yes | 22 | 85.31818 | 43.38 |
best_pic_win | count | avg_score | avg_score_delta |
---|---|---|---|
no | 584 | 60.17466 | -28.97 |
yes | 7 | 84.71429 | 40.78 |
best_actor_win | count | avg_score | avg_score_delta |
---|---|---|---|
no | 500 | 60.06600 | -4.14 |
yes | 91 | 62.65934 | 4.32 |
best_actress_win | count | avg_score | avg_score_delta |
---|---|---|---|
no | 521 | 60.04223 | -5.62 |
yes | 70 | 63.61429 | 5.95 |
best_dir_win | count | avg_score | avg_score_delta |
---|---|---|---|
no | 548 | 59.75547 | -14.04 |
yes | 43 | 69.51163 | 16.33 |
is_drama | count | avg_score | avg_score_delta |
---|---|---|---|
No | 290 | 55.40345 | -15.21 |
Yes | 301 | 65.34219 | 17.94 |
mpaa_rating_r | count | avg_score | avg_score_delta |
---|---|---|---|
No | 274 | 59.32117 | -3.47 |
Yes | 317 | 61.45426 | 3.60 |
oscar_season | count | avg_score | avg_score_delta |
---|---|---|---|
No | 415 | 59.65783 | -4.35 |
Yes | 176 | 62.36932 | 4.55 |
summer_season | count | avg_score | avg_score_delta |
---|---|---|---|
No | 397 | 60.55416 | 0.45 |
Yes | 194 | 60.28351 | -0.45 |
The average score delta shows the percentage difference between the two groups for each of the variables. The results are showing that the variables, best_pic_nom(43.38%), best_pic_win(40.78%), best_dir_win(16.33%), and is_drama(17.94%) have the largest delta in the audience average score metric.
I would anticipate the model will use these variables in the prediction of the audience score.
The calculated fields created in the data processing step:
- is_drama – Yes value: 17.94% average score delta
- mpaa_rating_r – Yes value: 3.6% average score delta
- oscar_season – Yes value: 4.55% average score delta
- summer_season – Yes value: -0.45% average score delta
The variables, is_drama and oscar_season are anticipated to have more value to the model than mpaa_rating_r or summer_season.
Part 4: Modeling
Modeling Method
Bayesian Model Averaging
Assumption: significance level of 0.05
- Generate the model
- Review summary statistics
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# generate the model movies_subset_model <- movies_subset %>% dplyr::select(runtime, thtr_rel_year ,imdb_rating ,imdb_num_votes ,critics_score ,top200_box ,best_pic_nom ,best_pic_win ,best_actor_win ,best_actress_win ,best_dir_win ,audience_score ,is_drama ,mpaa_rating_r ,oscar_season ,summer_season ) movies_model = bas.lm(audience_score ~., data = movies_subset_model, prior = "BIC" ,modelprior = uniform() ,method = "MCMC" ,MCMC.iterations = 50000 ) |
P(B != 0 | Y) | model 1 | model 2 | model 3 | model 4 | model 5 | |
---|---|---|---|---|---|---|
Intercept | 1.00000 | 1.0000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 |
runtime | 0.22832 | 0.0000 | 1.0000000 | 0.0000000 | 0.0000000 | 0.0000000 |
thtr_rel_year | 0.10710 | 0.0000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 |
imdb_rating | 0.99986 | 1.0000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 |
imdb_num_votes | 0.06308 | 0.0000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 |
critics_score | 0.83678 | 1.0000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 |
top200_boxyes | 0.04844 | 0.0000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 |
best_pic_nomyes | 0.11190 | 0.0000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 |
best_pic_winyes | 0.04154 | 0.0000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 |
best_actor_winyes | 0.14526 | 0.0000 | 0.0000000 | 0.0000000 | 1.0000000 | 0.0000000 |
best_actress_winyes | 0.12862 | 0.0000 | 0.0000000 | 0.0000000 | 0.0000000 | 1.0000000 |
best_dir_winyes | 0.06868 | 0.0000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 |
is_dramaYes | 0.04888 | 0.0000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 |
mpaa_rating_rYes | 0.19296 | 0.0000 | 0.0000000 | 1.0000000 | 0.0000000 | 0.0000000 |
oscar_seasonYes | 0.08030 | 0.0000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 |
summer_seasonYes | 0.08898 | 0.0000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 |
BF | NA | 1.0000 | 0.2940317 | 0.2275911 | 0.1986676 | 0.1526816 |
PostProbs | NA | 0.2017 | 0.0566000 | 0.0434000 | 0.0419000 | 0.0333000 |
R2 | NA | 0.7251 | 0.7269000 | 0.7267000 | 0.7265000 | 0.7263000 |
dim | NA | 3.0000 | 4.0000000 | 4.0000000 | 4.0000000 | 4.0000000 |
logmarg | NA | -3278.5884 | -3279.8124782 | -3280.0686152 | -3280.2045328 | -3280.4678110 |
1 2 3 |
# summary coefficent confint(coefficients(movies_model)) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
## 2.5% 97.5% beta ## Intercept 5.960184e+01 6.127487e+01 6.046531e+01 ## runtime -7.670914e-02 0.000000e+00 -1.227545e-02 ## thtr_rel_year -6.566192e-02 0.000000e+00 -5.982874e-03 ## imdb_rating 1.341579e+01 1.646231e+01 1.484804e+01 ## imdb_num_votes -5.147476e-08 2.055815e-06 2.716804e-07 ## critics_score 0.000000e+00 1.079631e-01 6.012133e-02 ## top200_boxyes 0.000000e+00 0.000000e+00 8.887782e-02 ## best_pic_nomyes -9.782284e-03 4.768328e+00 4.324454e-01 ## best_pic_winyes 0.000000e+00 0.000000e+00 -1.311505e-02 ## best_actor_winyes -2.716195e+00 0.000000e+00 -2.967139e-01 ## best_actress_winyes -2.646017e+00 0.000000e+00 -2.714538e-01 ## best_dir_winyes -1.609243e+00 1.183877e-01 -1.190107e-01 ## is_dramaYes 0.000000e+00 7.519898e-02 3.044855e-02 ## mpaa_rating_rYes -2.255332e+00 0.000000e+00 -3.116354e-01 ## oscar_seasonYes -1.257636e+00 2.031528e-03 -9.153999e-02 ## summer_seasonYes 0.000000e+00 1.314042e+00 1.059002e-01 ## attr(,"Probability") ## [1] 0.95 ## attr(,"class") ## [1] "confint.bas" |
In the summary table, Model 1 is using only 2 variables, imdb rating and critics_score. The posterior probability for the model is 0.1938.
After calculating the credible intervals for the regression coefficients the results are:
+95% probability audience_score value will increase by 1.35 to 1.64 for every point increase for the imdb_rating.
Model Performance Review
1 2 |
diagnostics(movies_model) |
The MCMC diagnostic plot shows if the Markov chain has converged; the quantities should be the same and appear on the line. There appears to be one exception, which isn’t a concern.
1 2 |
plot(movies_model) |
In the residuals there appears to be an issue. The residuals should be scattered randomly around the zero line. It seems that constant variability condition hasn’t been met. I am not surprised since it would be logical to assume that the imdb_rating which is obtained in a similar fashion as audience_score might be causing the issue.
The model probabilities plot appears normal.
Model complexity shows how increasing variables changes the Bayes factor. Models with greater than 2 variables appear to be able to predict outcomes.
On the inclusion probabilities, it is clearly only imdb_rating and critics_score that are about the 0.5 value.
Part 5: Prediction
For the prediction, I choose the 2016 movie Hidden Figures.
source:imdb website
The movie is story of a team of female African-American mathematicians who served a vital role in NASA during the early years of the U.S. space program.
Director: Theodore Melfi
Actors: Stars: Taraji P. Henson, Octavia Spencer, Janelle Monáe
imdb_rating: 7.8
genre: Drama
audience_score: 93
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
# data frame with relevant information new_movie <- data.frame(runtime =127 ,thtr_rel_year =2016 ,imdb_rating =7.8 ,imdb_num_votes =127666 ,critics_score =92 ,top200_box ="yes" ,best_pic_nom ="yes" ,best_pic_win ="no" ,best_actor_win ="no" ,best_actress_win ="yes" ,best_dir_win ="no" ,is_drama = "Yes" ,mpaa_rating_r = "No" ,oscar_season = "No" ,summer_season = "No" ,audience_score = 93 ) # calculate the prediciton movies_predict <- predict(movies_model, new_movie, se.fit = TRUE, esimator = "BMA" ,prediction=TRUE) movies_predict.fit <- round(movies_predict$fit[,1],1) prediction_error_percentage <- round((1- movies_predict.fit / 93)*100 , 2) movies_ci <- confint(movies_predict, estimator="BMA") |
The model predicted: 83.8
The actual results: 93
The error in the results was: 9.89%
With a confidence interval of 95% the lower bound the model presented was: 62.5703955
With a confidence interval of 95% the upper bound the model presented was: 103.7159054
The upper bound exceeds 100%; therefore we know there are some small issues.
Analysis is 95% confidant that the audience score is between 62.5703955 and 103.7159054
Part 6: Conclusion
The constructed model was designed to try and predict the audience_score for a movie based on 16 variables. The upper and lower bounds did contain the actual audience_score; a better model can most likely be constructed that can reduce the prediction error percentage.
One issue is the analysis didn’t meet the condition of constant variability for Bayesian modeling. One might surmise that the imdb_rating which is generated in a similar fashion as the audience_score is contributing to model issues. Additionally best picture nomination and best picture win variables may only apply after a significant portion of the audience reviews were tabulated.
Suggestion for further analysis would be to look at whether the protagonist was a male or female. Do audience members take the lead actor/actress into account when rating a movie?
Leave a Reply