The data created in the previous step needs a little bit of cleaning up before we can get into the model building. It is a common step that needs to be undertaken to ensure that the data can be loaded into models without any issues.
Part 3 of 4
Steps for Preparing the Data
Step 1 – Clean the Rankings
The first step is to clean up the rankings files that I created in the previous step.
Step 1.01 Load the Packages
Reference the necessary packages. Proto, gsubfn, magrittr, dplyr, and lubridate.
1 2 3 4 5 6 7 8 9 10 |
# ---------------------------------------------------------------------------- # clean up the ranking files and merge with the lyrics files # ---------------------------------------------------------------------------- library(proto ,quietly = TRUE, warn.conflicts = FALSE) #used by gsubfn library(gsubfn ,quietly = TRUE, warn.conflicts = FALSE) library(magrittr ,quietly = TRUE, warn.conflicts = FALSE) library(dplyr ,quietly = TRUE, warn.conflicts = FALSE) library(lubridate ,quietly = TRUE, warn.conflicts = FALSE) |
Step 1.02 Load the Ranking Data into Data Frames
Will use the standard library to read the flat files, and then store the data into a data frame. All of the flat files have the same number of columns and the column order is identical for each file, no need to reorganize or make adjustments afterwards.
Additionally because I created the ranking files in the prior step, I know that there are no null values.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
# load the ranking data df_rankings_drake <- read.delim(paste0(getwd(), "/ranks/Drake.txt") , header = TRUE , stringsAsFactors = FALSE ) df_rankings_onedirection <- read.delim(paste0(getwd(), "/ranks/One-Direction.txt") , header = TRUE , stringsAsFactors = FALSE ) df_rankings_pink <- read.delim(paste0(getwd(), "/ranks/Pink.txt") , header = TRUE , stringsAsFactors = FALSE ) df_rankings_rihanna <- read.delim(paste0(getwd(), "/ranks/Rihanna.txt") , header = TRUE , stringsAsFactors = FALSE ) df_rankings_taylorswift <- read.delim(paste0(getwd(), "/ranks/Taylor-Swift.txt") , header = TRUE , stringsAsFactors = FALSE ) df_rankings_u2 <- read.delim(paste0(getwd(), "/ranks/U2.txt") , header = TRUE , stringsAsFactors = FALSE ) |
Step 1.03 Function to Replace White Space
The function will replace the line feeds, tabs or any white space in the peek_rank column.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# ---------------------------------------------------------------------------- # update the data frames # ---------------------------------------------------------------------------- replace_empty <- function(textcontent){ # remove line feeds, tabs, and white space if("factor" %in% class(textcontent)){ textcontent <- as.character(textcontent) } replacement_list <- list("\n" = " ", "\t" = " ") textcontent <- ifelse(textcontent=="", "#1 on" , textcontent) textcontent <- gsubfn("." , replacement_list,textcontent) return(textcontent) } |
Step 1.04 Replace the Data Frame
For each artist, take the data frame will have four columns.
- artist: is the name of the artist
- song_title: name of the song
- peek_rank: the highest position the song reached on the Billboard Hot 100 rankings
- peek_date: the date the song reached its highest rank on the Billboard Hot 100 rankings
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
# -- Drake df_rankings_drake <- df_rankings_drake %>% transmute(artist = "Drake" ,song_title = song_title ,peek_rank = replace_empty(peek_rank) ,peek_date = mdy(peek_date) ) # -- one direction df_rankings_onedirection <- df_rankings_onedirection %>% transmute(artist = "OneDirection" ,song_title = song_title ,peek_rank = replace_empty(peek_rank) ,peek_date = mdy(peek_date) ) # -- Pink df_rankings_pink <- df_rankings_pink %>% transmute(artist = "Pink" ,song_title = song_title ,peek_rank = replace_empty(peek_rank) ,peek_date = mdy(peek_date) ) # -- Rihanna df_rankings_rihanna <- df_rankings_rihanna %>% transmute(artist = "Rihanna" ,song_title = song_title ,peek_rank = replace_empty(peek_rank) ,peek_date = mdy(peek_date) ) # -- Taylor Swift df_rankings_taylorswift <- df_rankings_taylorswift %>% transmute(artist = "TaylorSwift" ,song_title = song_title ,peek_rank = replace_empty(peek_rank) ,peek_date = mdy(peek_date) ) # -- U2 df_rankings_u2 <- df_rankings_u2 %>% transmute(artist = "U2" ,song_title = song_title ,peek_rank = replace_empty(peek_rank) ,peek_date = mdy(peek_date) ) |
Step 2 – Match the Song Ranks
Now that the rankings files have been loaded and stored into data frames, the next step is to join the ranking files with the lyric files.
Step 2.01 Load the necessary packages
Load the necessary packages magrittr, dplyr, and readr.
1 2 3 4 5 6 |
# ---------------------------------------------------------------------------- # take the ranking files and merge with the song lists # ---------------------------------------------------------------------------- library(magrittr ,quietly = TRUE, warn.conflicts = FALSE) library(dplyr ,quietly = TRUE, warn.conflicts = FALSE) library(readr ,quietly = TRUE, warn.conflicts = FALSE) |
Step 2.02 Load the Lyrics into Data Frames
Load each of the lyric .csv files that I created from the previous step into data frames.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
# load the songs lists df_songs_drake <- read.csv(paste0(getwd(), "/lyrics/album_song/Drake_lyrics.csv") , header = TRUE , stringsAsFactors = FALSE ) df_songs_onedirection <- read.csv(paste0(getwd(), "/lyrics/album_song/One-Direction_lyrics.csv") , header = TRUE , stringsAsFactors = FALSE ) df_songs_pink <- read.csv(paste0(getwd(), "/lyrics/album_song/pink_lyrics.csv") , header = TRUE , stringsAsFactors = FALSE ) df_songs_rihanna <- read.csv(paste0(getwd(), "/lyrics/album_song/Rihanna_lyrics.csv") , header = TRUE , stringsAsFactors = FALSE ) df_songs_taylorswift <- read.csv(paste0(getwd(), "/lyrics/album_song/Taylor-Swift_lyrics.csv") , header = TRUE , stringsAsFactors = FALSE ) df_songs_u2 <- read.csv(paste0(getwd(), "/lyrics/album_song/U2_lyrics.csv") , header = TRUE , stringsAsFactors = FALSE ) |
Step 2.03 Update the Data Frames Dropping the song_url Column
I will convert the data frame into a tibble, selecting all columns except the song_url column
1 2 3 4 5 6 7 8 9 |
# ---------------------------------------------------------------------------- # update the data frames # ---------------------------------------------------------------------------- df_songs_drake <- as_tibble(df_songs_drake %>% select(-c(song_url))) df_songs_onedirection <- as_tibble(df_songs_onedirection %>% select(-c(song_url))) df_songs_pink <- as_tibble(df_songs_pink %>% select(-c(song_url))) df_songs_rihanna <- as_tibble(df_songs_rihanna %>% select(-c(song_url))) df_songs_taylorswift <- as_tibble(df_songs_taylorswift %>% select(-c(song_url))) df_songs_u2 <- as_tibble(df_songs_u2 %>% select(-c(song_url))) |
Step 2.04 Merge Data Frames
1 2 3 4 5 6 7 8 9 |
# ---------------------------------------------------------------------------- # merge the data frames # ---------------------------------------------------------------------------- df_songs_drake <- df_songs_drake %>% left_join(df_rankings_drake , by = "song_title") df_songs_onedirection <- df_songs_onedirection %>% left_join(df_rankings_onedirection, by = "song_title") df_songs_pink <- df_songs_pink %>% left_join(df_rankings_pink , by = "song_title") df_songs_rihanna <- df_songs_rihanna %>% left_join(df_rankings_rihanna , by = "song_title") df_songs_taylorswift <- df_songs_taylorswift %>% left_join(df_rankings_taylorswift , by = "song_title") df_songs_u2 <- df_songs_u2 %>% left_join(df_rankings_u2 , by = "song_title") |
Step 2.05 Drop Unnecessary Data Frames
Since I no longer need the ranking data frames I will drop them from the global environment.
1 2 3 4 |
# ---------------------------------------------------------------------------- # drop the ranking data frames # ---------------------------------------------------------------------------- rm(list=ls(pattern = "df_rank")) |
Step 2.06 Fix the Artist Column
1 2 3 4 5 6 7 8 9 |
# ---------------------------------------------------------------------------- # populate the missing values # ---------------------------------------------------------------------------- df_songs_drake <- df_songs_drake %>% mutate(artist = "Drake") df_songs_onedirection <- df_songs_onedirection %>% mutate(artist = "One-Direction") df_songs_pink <- df_songs_pink %>% mutate(artist = "Pink") df_songs_rihanna <- df_songs_rihanna %>% mutate(artist = "Rihanna") df_songs_taylorswift <- df_songs_taylorswift %>% mutate(artist = "Taylor-Swift") df_songs_u2 <- df_songs_u2 %>% mutate(artist = "U2") |
Step 2.07 Merge the Data Frames
Since all of the columns in all of the data frames are same I can use the
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# ---------------------------------------------------------------------------- # merge the data frames # ---------------------------------------------------------------------------- df_songs <- bind_rows(df_songs_drake ,df_songs_onedirection ,df_songs_pink ,df_songs_rihanna ,df_songs_taylorswift ,df_songs_u2) df_backup_songs <- df_songs rm(list=ls(pattern = "df_songs_")) |
Step 2.08 Clean Up and Load Lyrics into Data Frame
Need to clean up the peek_rank column in the data frame. need to remove the hash symbol (#), and phrase ” on”. Updating the column values will allow the column to be converted to an integer.
Next I need to clean up some of the specific song titles. In the next step while creating the models, it won’t like ellipses (“…”) or double quotes for labeling in the model. The song title isn’t part of the model, just used as part of the unique identifier.
Next I created a custom function to get the lyrics from the flat files that I created in the previous step. I then insert the lyrics in to the data frame for each song.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
# ---------------------------------------------------------------------------- # convert peek_rank to an integer # ---------------------------------------------------------------------------- df_songs <- df_songs %>% mutate(peek_rank = stringr::str_replace(peek_rank,"#","")) df_songs <- df_songs %>% mutate(peek_rank = stringr::str_replace(peek_rank," on","")) df_songs <- df_songs %>% mutate(peek_rank = as.integer(peek_rank)) # ---------------------------------------------------------------------------- # clean up an invalid song value # ---------------------------------------------------------------------------- df_songs <- df_songs %>% mutate(song_title = if_else(song_title=="...Ready For It?" , "Ready For It?" , song_title)) df_songs <- df_songs %>% mutate(song_title = if_else(song_title=='The Crystal Ballroom (12" Mix)' , "The Crystal Ballroom (12 Mix)" , song_title)) # ---------------------------------------------------------------------------- # get the lyric text from the files # ---------------------------------------------------------------------------- get_lyric_text <- function(base_dir, song_name){ lyric_text = read_file(paste0(getwd() , "/lyrics/" ,base_dir ,"_Lyrics/" , song_name ,".txt")) return(lyric_text) } # vectorize the function get_lyric_text_v <- Vectorize(get_lyric_text) # load the lyrics into the dataframe df_songs_lyrics <- df_songs %>% mutate( lyrics = get_lyric_text_v(artist, song_title) ) |
Step 2.08 Add Feature Columns
Add four additional columns to the data frame
- album_decade: The decade the album was released
- charted: did the song appear on Billboard’s Hot 100 charts
- NumberOne: was the song have a peek ranking of number one on the Billboard Hot 100 chart
- chart_group: was the peek ranking of the song in the top 10, or top 100 on the Billboard Hot 100 chart
1 2 3 4 5 6 7 8 9 10 11 12 |
# ---------------------------------------------------------------------------- # add additional columns to the data # album_decade: decade album was released # charted: did the album chart # numberone: was it a number one song # chart_group: was it a top ten, 11-100, or not-charted # ---------------------------------------------------------------------------- df_songs_lyrics <- df_songs_lyrics %>% mutate(album_decade = as.integer(substr(album_year,1,3)) * 10) df_songs_lyrics <- df_songs_lyrics %>% mutate(charted = ifelse(is.na(peek_rank), "Not Charted" , "Charted")) df_songs_lyrics <- df_songs_lyrics %>% mutate(NumberOne = ifelse(is.na(peek_rank), FALSE, ifelse(peek_rank == 1, TRUE , FALSE))) df_songs_lyrics <- df_songs_lyrics %>% mutate(chart_group = ifelse(is.na(peek_rank), "Not Charted" , ifelse(peek_rank <= 10, "Top 10" , "Top 100"))) |
Step 2.09 Write the Results to a Tab Separated File
Lastly I will save the updated results into a tab separated file to be used in the next step, Model Prediction.
1 2 3 4 5 6 7 8 9 10 11 |
# ---------------------------------------------------------------------------- # save the data frame for future use # ---------------------------------------------------------------------------- write_tsv(df_songs_lyrics ,file.path(paste0(getwd(), "/df_song_lyrics.txt")) , na = "NA" , append = FALSE #, col_names = !append , quote_escape = "double" ) |
Leave a Reply