In this step, I will look at the data to see if we can do any feature engineering. And then I will edit the data for the model, train multiple models, evaluate the best model and then test the model. Let’s get started.

Part 4 of 4

Steps for Creating the Model

  1. Song Summary
  2. Visualizations
  3. Prepare the Lyrics for Analysis
  4. Model Building

Step 1 – Song Summary

Data wrangling was completed in step 3, Preparing the Data.

The overall procedure was extracting the lyrics for six artists from the website including artist name, album name, year the album was released. Using the website I extracted what was the peek rank and date of the peek rank for the song on the Billboard Hot 100 chart. The Billboard Hot 100 will be used as metric to determine the relative success of the song. While no metric can encapsulate how successful a song that all listeners would agree too, the chart is recognized by the music industry as reliable proxy.

Load Library Files

Load the necessary R library files.

Sample Record

Look at sample record in the data frame

Below we can see what one record from the data frame looks like.

Data Dictionary

  1. album_name: Name of the album
  2. album_year: Year the album was released
  3. song_title: Name of the song
  4. artist: Artist who created the song
  5. peek_rank: Highest rank the song received on the Billboard Hot 100 chart
  6. peek_date: the date the song achieved the peek_rank
  7. lyrics: the song lyrics
  8. album_decade: decade album was released
  9. charted: did the album chart
  10. NumberOne: was it a number one song
  11. chart_group: was it a top ten, 11-100, or not-charted

Step 2 – Visualizations

Charted Songs by Artist

In our data sample there are 899 songs, by 6 artists with 83 top 10 songs, 22 being number one songs, and 230 other songs that were in the top 100.

Songs by Artist and Chart Group

Number 1 Songs by Artist

Lyric Details

Step 3 – Prepare the lyrics for analysis

In preparing the lyrics for analysis need to do the following.

  1. Remove any unique words, words in the source document that weren’t meant to be part of the lyrics
  2. Ensure everything is in lower case
  3. Remove any numbers or punctuation that was in the lyrics. The main reason is to remove apostrophes from contraction words, however it will also remove any commas or periods from the text.
  4. Remove stop words from the lyrics
  5. Strip any white space around the words.
  6. Lastly remove any word that isn’t at least three characters in length.

Stop words are common words that we will remove before the text analysis. There is no common universal list of stop words, and it is subjective as to which stop words to remain and remove. In addition to the list of stop words listed in the snowball R package I have included some other words I wanted to remove because I don’t think they add value to the analysis.

Word Frequency

One of the features of songs we want to explore is, does the number of words in a song impact it’s performance? I will look at the total number of words in the songs to help determine.

Compare the chart groups side by side

Most Common Words Used in Lyrics


Step 4 Model Building

Take a quick look at the summarized view of the artists, number of songs and how many top 10 and top 100 songs the artist had

Now look at wow many unique words are in each chart group. The word needs to be used in at least three songs.

Selecting by word_percent

I am going to removed words that are in more than one group.

Breaking the data frame into testing and training group.

Use the custom function created earlier to convert all the text to lower case, remove any numbers, remove any punctuation, remove any stop words, and remove any extra white space in the text.

Added the function lyric_features to add the additional columns for feature engineering.

And lastly created the training and test data frames.

Building the Model

Normalize the datasets

Need to normalize the datasets for each of the models, using a range of 0 to 1.

Create the Classifiers

Using a variety of models, including Naive Bayes, LDA, KSVM, KNN, RPart, Random Forest, XGBoost, and NNET.

Run the benchmark for each of the models and find out which model preforms best.

It looks like that Random Forest has the best results. Ok. It will be what I will use in the testing set.

Plot Training Results

Here we can see a nice plot of how the different models performed. The only models that did terrible were KNN, and NNET. The rest of the models were above 0.85.

Confusion Matrix

Now looking at confusion matrix for our Random Forest Model. The matrix shows us the correctly identified and incorrectly identified items.

Feature Importance

Now to see what features contribute to the aiding in the prediction of the model. Top 10 word count and top 100 word count are the most significant. And as one would imagine the length of the song title has almost no significance.

Testing the Model

Our final step is to test the model. Our accuracy is 0.925. The model’s performance is quite acceptable.

Enhancing the Model

Naturally I am only looking at small subset of all artists. The way to increase the accuracy of the model would be to increase the number of artists, and bring in their lyrics.