Modeling and Prediction for Movie Audience Ratings


One of the key issues facing film production companies is, will the production company make a profit from a movie. It is assumed that favorable audience reviews will in-turn lead to higher ticket sales or DVD sales, both items directly affect a movie’s profitability.

The analysis will look at what attributes lead to a higher average audience review score on the public website, Rotten Tomatoes

Spoiler Alert The analysis creates a model that is close, but isn’t 100% confident.

Part 1: Data

Data Set Source

Movie Information was provided from three websites, Internet Movie Database, referred to as IMDB, Rotten Tomatoes, and Box Office Mojo

IMDB and Rotten Tomatoes websites allow for general public to submit their opinion on a given movie. However there is no validation on if the person submitting the opinion actually saw the movie or the opinion is completely unsolicited. The reviews are not limited to one country, and the reviews are not limited by age. Review collection is not conducted in a scientific manner, and it amounts to a popularity opinion poll. The collection is limited to general population who have internet access, visited the aforementioned website(s), and are aware that they can submit a review for a specific movie.

With that said it is advisable that the results can not be generalized to the entire general movie-going population. The analysis will not try to establish a causal relationship between the variables as there was no random assignment for explanatory and response variables.

An archived version of movie informational data was used. Movie Information Source

Load packages

Load data

Data Dictionary

A brief description of the fields used in the analysis are listed in the following section.

Codebook source

field names field types field description calculated field
2 feature_film char Is movie a feature film Y
3 is_drama char Genre of the movie is Drama Y
4 runtime int Runtime of movie in minutes N
5 mpaa_rating_r char MPAA rating is R Y
6 thtr_rel_year int Year the movie is released in theaters N
7 oscar_season char Month movie is release in theaters in October, November, or December Y
8 summer_season char Month movie is release in theaters in May, June, July, August Y
9 imdb_rating int Rating on IMDB. Rating on IMDB on a scale of 1-10; 10 being highest. N
10 imdb_num_votes int Number of votes on IMDB N
11 critics_score int Critics score on Rotten Tomatoes N
12 best_pic_nom char Whether or not the movie was nominated for a best picture Oscar: yes, no N
13 best_pic_win char Whether or not the movie won a best picture Oscar: yes, no N
14 best_actor_win char Whether or not one of the main actors in the movie ever won an Oscar: yes, no N
15 best_actress_win char Whether or not one of the main actresses in the movie ever won an Oscar: yes, no N
16 best_dir_win char Whether or not the director of the movie ever won an Oscar: yes, no N
17 top200_box int Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo N
18 audience_score int Audience score on Rotten Tomatoes N

Part 2: Data Processing

Create Calculated Columns

Create the following calculated columns:
+ feature_film: Use the variable title_type to assign yes or no if title_type == “Feature Film”
+ is_drama: use the variable genre, assign yes or no if genre == “Drama”
+ mpaa_rating_r: use the variable mpaa_rating, assign yes or no if mpaa_rating == “R”
+ oscar_season: use the variable thtr_rel_month, if value is in 10,11,12 assign yes, otherwise no.
+ summer_season: use the variable thtr_rel_month, if value is in 5,6,7,8 assign yes, otherwise no.

Subsetting the Data

  1. Select the columns used for the analysis
    • runtime, thtr_rel_year ,imdb_rating ,imdb_num_votes ,critics_score ,top200_box
    • best_pic_nom ,best_pic_win ,best_actor_win ,best_actress_win ,best_dir_win
    • audience_score
    • feature_film ,is_drama ,mpaa_rating_r ,oscar_season ,summer_season
  2. Remove any observations that have NA values

Remove observations

  1. Extract observations for feature film’s only; TV Movies are ineligible for an Oscar, documentaries do not have actors, and top 200 variable would not be applicable.

Non feature films removed from the observations: 63

Part 3: Exploratory data analysis

Exploratory data analysis will look at the calculated fields that were created in the data processing step:

  • feature_film
  • is_drama
  • mpaa_rating_r
  • oscar_season
  • summer_season

Additionally I added the other variables that have binomial values for comparisons.

Summary Statistics:

binominal variables

A summary tables of all variables that have a binomial value.

top200_box count avg_score avg_score_delta
no 576 60.09896 -19.37
yes 15 74.53333 24.02
best_pic_nom count avg_score avg_score_delta
no 569 59.50439 -30.26
yes 22 85.31818 43.38
best_pic_win count avg_score avg_score_delta
no 584 60.17466 -28.97
yes 7 84.71429 40.78
best_actor_win count avg_score avg_score_delta
no 500 60.06600 -4.14
yes 91 62.65934 4.32
best_actress_win count avg_score avg_score_delta
no 521 60.04223 -5.62
yes 70 63.61429 5.95
best_dir_win count avg_score avg_score_delta
no 548 59.75547 -14.04
yes 43 69.51163 16.33
is_drama count avg_score avg_score_delta
No 290 55.40345 -15.21
Yes 301 65.34219 17.94
mpaa_rating_r count avg_score avg_score_delta
No 274 59.32117 -3.47
Yes 317 61.45426 3.60
oscar_season count avg_score avg_score_delta
No 415 59.65783 -4.35
Yes 176 62.36932 4.55
summer_season count avg_score avg_score_delta
No 397 60.55416 0.45
Yes 194 60.28351 -0.45


The average score delta shows the percentage difference between the two groups for each of the variables. The results are showing that the variables, best_pic_nom(43.38%), best_pic_win(40.78%), best_dir_win(16.33%), and is_drama(17.94%) have the largest delta in the audience average score metric.

I would anticipate the model will use these variables in the prediction of the audience score.


The calculated fields created in the data processing step:

  • is_drama – Yes value: 17.94% average score delta
  • mpaa_rating_r – Yes value: 3.6% average score delta
  • oscar_season – Yes value: 4.55% average score delta
  • summer_season – Yes value: -0.45% average score delta

The variables, is_drama and oscar_season are anticipated to have more value to the model than mpaa_rating_r or summer_season.


Part 4: Modeling

Modeling Method

Bayesian Model Averaging

Assumption: significance level of 0.05

  1. Generate the model
  2. Review summary statistics

P(B != 0 | Y) model 1 model 2 model 3 model 4 model 5
Intercept 1.00000 1.0000 1.0000000 1.0000000 1.0000000 1.0000000
runtime 0.22832 0.0000 1.0000000 0.0000000 0.0000000 0.0000000
thtr_rel_year 0.10710 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
imdb_rating 0.99986 1.0000 1.0000000 1.0000000 1.0000000 1.0000000
imdb_num_votes 0.06308 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
critics_score 0.83678 1.0000 1.0000000 1.0000000 1.0000000 1.0000000
top200_boxyes 0.04844 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
best_pic_nomyes 0.11190 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
best_pic_winyes 0.04154 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
best_actor_winyes 0.14526 0.0000 0.0000000 0.0000000 1.0000000 0.0000000
best_actress_winyes 0.12862 0.0000 0.0000000 0.0000000 0.0000000 1.0000000
best_dir_winyes 0.06868 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
is_dramaYes 0.04888 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
mpaa_rating_rYes 0.19296 0.0000 0.0000000 1.0000000 0.0000000 0.0000000
oscar_seasonYes 0.08030 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
summer_seasonYes 0.08898 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
BF NA 1.0000 0.2940317 0.2275911 0.1986676 0.1526816
PostProbs NA 0.2017 0.0566000 0.0434000 0.0419000 0.0333000
R2 NA 0.7251 0.7269000 0.7267000 0.7265000 0.7263000
dim NA 3.0000 4.0000000 4.0000000 4.0000000 4.0000000
logmarg NA -3278.5884 -3279.8124782 -3280.0686152 -3280.2045328 -3280.4678110

In the summary table, Model 1 is using only 2 variables, imdb rating and critics_score. The posterior probability for the model is 0.1938.

After calculating the credible intervals for the regression coefficients the results are:
+95% probability audience_score value will increase by 1.35 to 1.64 for every point increase for the imdb_rating.


Model Performance Review

plot of chunk mpp

The MCMC diagnostic plot shows if the Markov chain has converged; the quantities should be the same and appear on the line. There appears to be one exception, which isn’t a concern.

plot of chunk plot_model_details

In the residuals there appears to be an issue. The residuals should be scattered randomly around the zero line. It seems that constant variability condition hasn’t been met. I am not surprised since it would be logical to assume that the imdb_rating which is obtained in a similar fashion as audience_score might be causing the issue.

The model probabilities plot appears normal.

Model complexity shows how increasing variables changes the Bayes factor. Models with greater than 2 variables appear to be able to predict outcomes.

On the inclusion probabilities, it is clearly only imdb_rating and critics_score that are about the 0.5 value.

Part 5: Prediction

For the prediction, I choose the 2016 movie Hidden Figures.

source:imdb website

The movie is story of a team of female African-American mathematicians who served a vital role in NASA during the early years of the U.S. space program.

Director: Theodore Melfi

Actors: Stars: Taraji P. Henson, Octavia Spencer, Janelle Monáe

imdb_rating: 7.8

genre: Drama

audience_score: 93

The model predicted: 83.8

The actual results: 93

The error in the results was: 9.89%

With a confidence interval of 95% the lower bound the model presented was: 62.5703955

With a confidence interval of 95% the upper bound the model presented was: 103.7159054

The upper bound exceeds 100%; therefore we know there are some small issues.

Analysis is 95% confidant that the audience score is between 62.5703955 and 103.7159054

Part 6: Conclusion

The constructed model was designed to try and predict the audience_score for a movie based on 16 variables. The upper and lower bounds did contain the actual audience_score; a better model can most likely be constructed that can reduce the prediction error percentage.

One issue is the analysis didn’t meet the condition of constant variability for Bayesian modeling. One might surmise that the imdb_rating which is generated in a similar fashion as the audience_score is contributing to model issues. Additionally best picture nomination and best picture win variables may only apply after a significant portion of the audience reviews were tabulated.

Suggestion for further analysis would be to look at whether the protagonist was a male or female. Do audience members take the lead actor/actress into account when rating a movie?