The data created in the previous step needs a little bit of cleaning up before we can get into the model building. It is a common step that needs to be undertaken to ensure that the data can be loaded into models without any issues.

Part 3 of 4

Steps for Preparing the Data

  1. Clean the Rankings
  2. Match the Song Ranks

Step 1 – Clean the Rankings

The first step is to clean up the rankings files that I created in the previous step.

Step 1.01 Load the Packages

Reference the necessary packages. Proto, gsubfn, magrittr, dplyr, and lubridate.

Step 1.02 Load the Ranking Data into Data Frames

Will use the standard library to read the flat files, and then store the data into a data frame. All of the flat files have the same number of columns and the column order is identical for each file, no need to reorganize or make adjustments afterwards.

Additionally because I created the ranking files in the prior step, I know that there are no null values.

Step 1.03 Function to Replace White Space

The function will replace the line feeds, tabs or any white space in the peek_rank column.

Step 1.04 Replace the Data Frame

For each artist, take the data frame will have four columns.

  1. artist: is the name of the artist
  2. song_title: name of the song
  3. peek_rank: the highest position the song reached on the Billboard Hot 100 rankings
  4. peek_date: the date the song reached its highest rank on the Billboard Hot 100 rankings

Step 2 – Match the Song Ranks

Now that the rankings files have been loaded and stored into data frames, the next step is to join the ranking files with the lyric files.

Step 2.01 Load the necessary packages

Load the necessary packages magrittr, dplyr, and readr.

Step 2.02 Load the Lyrics into Data Frames

Load each of the lyric .csv files that I created from the previous step into data frames.

Step 2.03 Update the Data Frames Dropping the song_url Column

I will convert the data frame into a tibble, selecting all columns except the song_url column

Step 2.04 Merge Data Frames

Step 2.05 Drop Unnecessary Data Frames

Since I no longer need the ranking data frames I will drop them from the global environment.

Step 2.06 Fix the Artist Column

Step 2.07 Merge the Data Frames

Since all of the columns in all of the data frames are same I can use the

Step 2.08 Clean Up and Load Lyrics into Data Frame

Need to clean up the peek_rank column in the data frame. need to remove the hash symbol (#), and phrase ” on”. Updating the column values will allow the column to be converted to an integer.

Next I need to clean up some of the specific song titles. In the next step while creating the models, it won’t like ellipses (“…”) or double quotes for labeling in the model. The song title isn’t part of the model, just used as part of the unique identifier.

Next I created a custom function to get the lyrics from the flat files that I created in the previous step. I then insert the lyrics in to the data frame for each song.

Step 2.08 Add Feature Columns

Add four additional columns to the data frame

  1. album_decade: The decade the album was released
  2. charted: did the song appear on Billboard’s Hot 100 charts
  3. NumberOne: was the song have a peek ranking of number one on the Billboard Hot 100 chart
  4. chart_group: was the peek ranking of the song in the top 10, or top 100 on the Billboard Hot 100 chart

Step 2.09 Write the Results to a Tab Separated File

Lastly I will save the updated results into a tab separated file to be used in the next step, Model Prediction.