{"id":387,"date":"2020-02-10T20:00:27","date_gmt":"2020-02-10T20:00:27","guid":{"rendered":"https:\/\/eipsoftware.com\/musings\/?p=387"},"modified":"2021-10-03T21:11:23","modified_gmt":"2021-10-03T21:11:23","slug":"lyrical-success-model-prediction","status":"publish","type":"post","link":"https:\/\/eipsoftware.com\/musings\/lyrical-success-model-prediction\/","title":{"rendered":"Lyrical Success &#8211; Model Prediction"},"content":{"rendered":"\n<p>In this step, I will look at the data to see if we can do any feature engineering. And then I will edit the data for the model, train multiple models, evaluate the best model and then test the model.  Let&#8217;s get started.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Part 4 of 4<\/h2>\n\n\n\n<p>Steps for Creating the Model<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li><a href=\"#song-summary\" data-type=\"internal\" data-id=\"#song-summary\">Song Summary<\/a><\/li><li><a href=\"#visualizations\" data-type=\"internal\" data-id=\"#visualizations\">Visualizations<\/a><\/li><li><a href=\"#lyric-details\" data-type=\"internal\" data-id=\"#lyric-details\">Prepare the Lyrics for Analysis<\/a><\/li><li><a href=\"#model-building\" data-type=\"internal\" data-id=\"#model-building\">Model Building<\/a><\/li><\/ol>\n\n\n\n<!--more-->\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"song-summary\">Step 1 &#8211; Song Summary<\/h2>\n\n\n\n<p>Data wrangling was completed in step 3, <a href=\"https:\/\/eipsoftware.com\/musings\/2020\/02\/10\/lyrical-success-preparing-the-data\/\">Preparing the Data<\/a>.<\/p>\n\n\n\n<p>The overall procedure was extracting the lyrics for six artists from the website&nbsp;<a href=\"https:\/\/www.azlyrics.com\">https:\/\/www.azlyrics.com<\/a>&nbsp;including artist name, album name, year the album was released. Using the website&nbsp;<a href=\"https:\/\/www.billboard.com\">https:\/\/www.billboard.com<\/a>&nbsp;I extracted what was the peek rank and date of the peek rank for the song on the Billboard Hot 100 chart. The Billboard Hot 100 will be used as metric to determine the relative success of the song. While no metric can encapsulate how successful a song that all listeners would agree too, the chart is recognized by the music industry as reliable proxy.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Load Library Files<\/h4>\n\n\n\n<p>Load the necessary R library files.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:r decode:true \">library(tibble ,quietly = TRUE, warn.conflicts = FALSE)\nlibrary(magrittr ,quietly = TRUE, warn.conflicts = FALSE)\nlibrary(dplyr ,quietly = TRUE, warn.conflicts = FALSE)\nlibrary(ggplot2 ,quietly = TRUE, warn.conflicts = FALSE)\nlibrary(NLP ,quietly = TRUE, warn.conflicts = FALSE)  #used by tm\nlibrary(tm ,quietly = TRUE, warn.conflicts = FALSE)\nlibrary(knitr ,quietly = TRUE, warn.conflicts = FALSE)<\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Sample Record<\/h4>\n\n\n\n<p>Look at sample record in the data frame<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:r decode:true \">df_songs_lyrics &lt;- readr::read_tsv(file.path(paste0(getwd(), \"\/df_song_lyrics.txt\")))<\/pre><\/div>\n\n\n\n<pre class=\"wp-block-preformatted\">## Parsed with column specification:\n## cols(\n##   album_name = col_character(),\n##   album_year = col_double(),\n##   song_title = col_character(),\n##   artist = col_character(),\n##   peek_rank = col_double(),\n##   peek_date = col_date(format = \"\"),\n##   lyrics = col_character(),\n##   album_decade = col_double(),\n##   charted = col_character(),\n##   NumberOne = col_logical(),\n##   chart_group = col_character()\n## )<\/pre>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:r decode:true \"># ----------------------------------------------------------------------------\n# look at one of the values\n# ----------------------------------------------------------------------------\nglimpse(df_songs_lyrics[255,])<\/pre><\/div>\n\n\n\n<p>Below we can see what one record from the data frame looks like.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">## Observations: 1\n## Variables: 11\n## $ album_name   &lt;chr&gt; \"Take Me Home\"\n## $ album_year   &lt;dbl&gt; 2012\n## $ song_title   &lt;chr&gt; \"Heart Attack\"\n## $ artist       &lt;chr&gt; \"One-Direction\"\n## $ peek_rank    &lt;dbl&gt; NA\n## $ peek_date    &lt;date&gt; NA\n## $ lyrics       &lt;chr&gt; \"\\nBaby, you got me sick,\\nI don't know what I did,\u2026\n## $ album_decade &lt;dbl&gt; 2010\n## $ charted      &lt;chr&gt; \"Not Charted\"\n## $ NumberOne    &lt;lgl&gt; FALSE\n## $ chart_group  &lt;chr&gt; \"Not Charted\"<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Data Dictionary<\/h3>\n\n\n\n<ol class=\"wp-block-list\"><li>album_name: Name of the album<\/li><li>album_year: Year the album was released<\/li><li>song_title: Name of the song<\/li><li>artist: Artist who created the song<\/li><li>peek_rank: Highest rank the song received on the Billboard Hot 100 chart<\/li><li>peek_date: the date the song achieved the peek_rank<\/li><li>lyrics: the song lyrics<\/li><li>album_decade: decade album was released<\/li><li>charted: did the album chart <\/li><li>NumberOne: was it a number one song <\/li><li>chart_group: was it a top ten, 11-100, or not-charted<\/li><\/ol>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-background has-dark-gray-background-color has-dark-gray-color\" id=\"visualizations\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Step 2 &#8211; Visualizations<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Charted Songs by Artist<\/h3>\n\n\n\n<p>In our data sample there are 899 songs, by 6 artists with 83 top 10 songs, 22 being number one songs, and 230 other songs that were in the top 100.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:r decode:true \">library(ggplot2 ,quietly = TRUE, warn.conflicts = FALSE)\n\ndf_songs_lyrics %&gt;%\n  group_by(artist, charted) %&gt;%\n  summarise(number_of_songs = n()) %&gt;%\n  ggplot() +\n  geom_bar(aes(x=artist\n               ,y=number_of_songs\n               ,fill = charted)\n          ,stat = \"identity\") +\n  labs(x=NULL, y=\"# of Songs\")+\n  ggtitle(\"Charted Songs by Artist\")<\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full is-style-default\"><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"432\" src=\"https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/01_charted_songs_by_artist.png\" alt=\"\" class=\"wp-image-389\" srcset=\"https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/01_charted_songs_by_artist.png 700w, https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/01_charted_songs_by_artist-300x185.png 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Songs by Artist and Chart Group<\/h3>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:r decode:true \">df_songs_lyrics %&gt;%\n  group_by(artist, chart_group) %&gt;%\n  filter(peek_rank &gt; 0) %&gt;%\n  summarise(number_of_songs = n()) %&gt;%\n  ggplot() +\n  geom_bar(aes(x=artist\n               ,y=number_of_songs\n               ,fill = chart_group)\n          ,stat = \"identity\") +\n  labs(x=NULL, y=\"# of Songs\") +\n  ggtitle(\"Songs by Artist and Chart Group\")<\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"432\" src=\"https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/02_songs_by_artists_and_chart_group.png\" alt=\"\" class=\"wp-image-395\" srcset=\"https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/02_songs_by_artists_and_chart_group.png 700w, https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/02_songs_by_artists_and_chart_group-300x185.png 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/figure>\n\n\n\n<p>Number 1 Songs by Artist<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:r decode:true \">df_songs_lyrics %&gt;%\n  group_by(artist) %&gt;%\n  filter(peek_rank == 1) %&gt;%\n  summarise(number_of_songs = n()) %&gt;%\n  ggplot() +\n  geom_bar(aes(x=artist\n               ,y=number_of_songs\n               ,fill = artist)\n          ,stat = \"identity\") +\n  labs(x=NULL, y=\"# of Songs\") +\n  ggtitle(\"Number 1 Songs by Artist and Chart Group\")<\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"432\" src=\"https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/03_number_one_songs_by_artist.png\" alt=\"\" class=\"wp-image-397\" srcset=\"https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/03_number_one_songs_by_artist.png 700w, https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/03_number_one_songs_by_artist-300x185.png 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-background has-dark-gray-background-color has-dark-gray-color\" id=\"lyric-details\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">Lyric Details<\/h1>\n\n\n\n<h2 class=\"wp-block-heading\">Step 3 &#8211; Prepare the lyrics for analysis<\/h2>\n\n\n\n<p>In preparing the lyrics for analysis need to do the following.<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Remove any unique words, words in the source document that weren\u2019t meant to be part of the lyrics <\/li><li>Ensure everything is in lower case <\/li><li>Remove any numbers or punctuation that was in the lyrics. The main reason is to remove apostrophes from contraction words, however it will also remove any commas or periods from the text. <\/li><li>Remove stop words from the lyrics <\/li><li>Strip any white space around the words. <\/li><li>Lastly remove any word that isn\u2019t at least three characters in length.<\/li><\/ol>\n\n\n\n<p><\/p>\n\n\n\n<p>Stop words are common words that we will remove before the text analysis. There is no common universal list of stop words, and it is subjective as to which stop words to remain and remove. In addition to the list of stop words listed in the snowball R package I have included some other words I wanted to remove because I don\u2019t think they add value to the analysis.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:r decode:true \">remove_words &lt;- c(\"chorus\", \"repeat\" ,\"hey\" ,\"uh\" ,\"whoa\"\n                 )\n\nscrubLyrics &lt;- function(text_lyric){\n\n  # convert to lower case, remove numbers, punctuation, stopwords, whitespace\n  text_lyric &lt;- text_lyric %&gt;%\n                  tolower() %&gt;%\n                  removeNumbers() %&gt;%\n                  removePunctuation() %&gt;%\n                  removeWords(stopwords(\"en\")) %&gt;%\n                  stripWhitespace()\n\n  return(text_lyric)\n    \n}\n\n# copy into new dataframe\ndf_scrubbedLyrics &lt;- df_songs_lyrics\n\n# scrub the lyrics\ndf_scrubbedLyrics$lyrics &lt;- lapply(df_scrubbedLyrics$lyrics, scrubLyrics)\n\n# tokenize the lyrics\n# expand the data frame so one word per row\n# remove \ndf_scrubbedLyrics &lt;- df_scrubbedLyrics %&gt;%\n  tidytext::unnest_tokens(t_words , lyrics) %&gt;%\n  filter(!t_words %in% remove_words) %&gt;%\n  filter(nchar(t_words) &gt;=3 )<\/pre><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Word Frequency<\/h2>\n\n\n\n<p>One of the features of songs we want to explore is, does the number of words in a song impact it\u2019s performance? I will look at the total number of words in the songs to help determine.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:r decode:true \">df_all_words &lt;- df_songs_lyrics %&gt;%\n  unnest_tokens(t_words , lyrics) %&gt;%\n  group_by(artist, song_title, chart_group) %&gt;%\n  summarise(word_count = n()) %&gt;%\n  arrange(desc(word_count))\n\n\ndf_all_words %&gt;%\n  ggplot() +\n  geom_histogram( aes(x=word_count, fill=chart_group)) +\n  labs(x=\"Words per Song\", y=\"# of Songs\") +\n  ggtitle(\"Songs by Artist and Chart Group\") +\n  theme(legend.title = element_blank())<\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"432\" src=\"https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/04_words_per_song_chart_group.png\" alt=\"\" class=\"wp-image-398\" srcset=\"https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/04_words_per_song_chart_group.png 700w, https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/04_words_per_song_chart_group-300x185.png 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/figure>\n\n\n\n<p>Compare the chart groups side by side<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:r decode:true \">df_all_words %&gt;%\n  ggplot() +\n  geom_histogram( aes(x=word_count, fill=chart_group)) +\n  facet_wrap(~chart_group, ncol = 3) +\n  labs(x=\"Words per Song\", y=\"# of Songs\") +\n  ggtitle(\"Songs by Artist and Chart Group\") +\n  theme(legend.title = element_blank())<\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"432\" src=\"https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/05_words_per_song_chart_group.png\" alt=\"\" class=\"wp-image-400\" srcset=\"https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/05_words_per_song_chart_group.png 700w, https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/05_words_per_song_chart_group-300x185.png 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Most Common Words Used in Lyrics<\/h3>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:r decode:true \">df_scrubbedLyrics %&gt;%\n  distinct() %&gt;%\n  count(t_words, sort = TRUE) %&gt;%\n  top_n(10) %&gt;%\n  ungroup() %&gt;%\n  mutate(t_words = reorder(t_words, n)) %&gt;%\n  ggplot() +\n    geom_col(aes(t_words, n), fill = \"#E69F00\") + \n    coord_flip() +\n    labs(x=\"Songs per Word\", y=\"# of Songs\") +\n    ggtitle(\"Most Frequenty Used Word in Lyrics\")<\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"432\" src=\"https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/06_frequent_used_words.png\" alt=\"\" class=\"wp-image-401\" srcset=\"https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/06_frequent_used_words.png 700w, https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/06_frequent_used_words-300x185.png 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/figure>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:r decode:true \">words_by_artist &lt;- df_scrubbedLyrics %&gt;%\n  distinct() %&gt;%\n  group_by(artist) %&gt;%\n  count(t_words, artist, sort = TRUE) %&gt;%\n  slice(seq_len(10)) %&gt;%\n  ungroup() %&gt;%\n  arrange(artist , n) %&gt;%\n  mutate(display_row = row_number())\n\nwords_by_artist %&gt;%  \n    ggplot() +\n    geom_col(aes(display_row, n, fill=artist)\n             ,show.legend = FALSE) + \n    coord_flip() +\n    facet_wrap(~artist, scales = \"free\") +\n    scale_x_continuous(labels = words_by_artist$t_words\n                       ,breaks = words_by_artist$display_row) +\n    labs(x=\"Songs per Word\", y=\"# of Songs\") +\n    ggtitle(\"Most Frequenty Used Word in Lyrics by Artist\") <\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"672\" height=\"415\" src=\"https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/07_frequent_used_words_by_artist.png\" alt=\"\" class=\"wp-image-402\" srcset=\"https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/07_frequent_used_words_by_artist.png 672w, https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/07_frequent_used_words_by_artist-300x185.png 300w\" sizes=\"auto, (max-width: 672px) 100vw, 672px\" \/><\/figure>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:r decode:true \">words_by_chart_group &lt;- df_scrubbedLyrics %&gt;%\n  distinct() %&gt;%\n  group_by(chart_group) %&gt;%\n  count(t_words, chart_group, sort = TRUE) %&gt;%\n  slice(seq_len(10)) %&gt;%\n  ungroup() %&gt;%\n  arrange(chart_group , n) %&gt;%\n  mutate(display_row = row_number())\n\nwords_by_chart_group %&gt;%  \n    ggplot() +\n    geom_col(aes(display_row, n, fill=chart_group)\n             ,show.legend = FALSE) + \n    coord_flip() +\n    facet_wrap(~chart_group, scales = \"free\") +\n    scale_x_continuous(labels = words_by_chart_group$t_words\n                       ,breaks = words_by_chart_group$display_row) +\n    labs(x=\"Songs per Word\", y=\"# of Songs\") +\n    ggtitle(\"Most Frequenty Used Word in Lyrics by Chart Group\") <\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"672\" height=\"415\" src=\"https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/07_frequent_used_words_by_chart_group.png\" alt=\"\" class=\"wp-image-403\" srcset=\"https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/07_frequent_used_words_by_chart_group.png 672w, https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/07_frequent_used_words_by_chart_group-300x185.png 300w\" sizes=\"auto, (max-width: 672px) 100vw, 672px\" \/><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-background has-black-background-color has-black-color\"\/>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"model-building\">Prediction<\/h1>\n\n\n\n<h2 class=\"wp-block-heading\">Step 4 Model Building <\/h2>\n\n\n\n<p>Take a quick look at the summarized view of the artists, number of songs and how many top 10 and top 100 songs the artist had<\/p>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:r decode:true \">library(tibble ,quietly = TRUE, warn.conflicts = FALSE)\nlibrary(magrittr ,quietly = TRUE, warn.conflicts = FALSE)\nlibrary(dplyr ,quietly = TRUE, warn.conflicts = FALSE)\nlibrary(ggplot2 ,quietly = TRUE, warn.conflicts = FALSE)<\/pre><\/div>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:r decode:true \">df_songs_lyrics %&gt;%\n  group_by(artist ,chart_group) %&gt;%\n  summarise(SongCount = n()) %&gt;%\n  reshape2::dcast(artist ~ chart_group, value.var = \"SongCount\") %&gt;%\n  `colnames&lt;-` (c(\"artist\",\"Not.Charted\",\"Top.10\",\"Top.100\")) %&gt;%\n  mutate(Total = Top.10 + Top.100 + Not.Charted)<\/pre><\/div>\n\n\n\n<pre class=\"wp-block-preformatted\">##          artist Not.Charted Top.10 Top.100 Total\n## 1         Drake         113     19     101   233\n## 2 One-Direction          61      6      21    88\n## 3          Pink          99     11      13   123\n## 4       Rihanna          86     21      16   123\n## 5  Taylor-Swift          44     21      59   124\n## 6            U2         183      5      20   208<\/pre>\n\n\n\n<p><\/p>\n\n\n\n<p>Now look at wow many unique words are in each chart group. The word needs to be used in at least three songs.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:r decode:true \">select_n_words &lt;- 5000\n    \ndf_top_words_per_group &lt;-  df_scrubbedLyrics %&gt;%\n      group_by(chart_group) %&gt;%\n      mutate(group_word_count = n()) %&gt;%\n      group_by(chart_group, t_words) %&gt;%\n      mutate(word_count = n()\n             ,word_percent = word_count \/ group_word_count) %&gt;%\n      select(t_words, chart_group, group_word_count, word_count, word_percent) %&gt;%\n      distinct() %&gt;%\n      filter(word_count &gt;= 3) %&gt;%\n      arrange(desc(word_percent)) %&gt;%\n      top_n(select_n_words)\n<\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Selecting by word_percent<\/h3>\n\n\n\n<p>I am going to removed words that are in more than one group. <\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:r decode:true \"># remove words that are in more than one group\ndf_top_words &lt;- df_top_words_per_group %&gt;%\n      ungroup() %&gt;%\n      group_by(t_words) %&gt;%\n      mutate(multi_group = n()) %&gt;%\n      filter(multi_group &lt; 2) %&gt;%\n      select(chart_group, common_word = t_words)\n\n# create lists of unique words by chart_group\nwords_not_charted &lt;- lapply(df_top_words[df_top_words$chart_group == \"Not Charted\",], as.character)\nwords_top_100 &lt;- lapply(df_top_words[df_top_words$chart_group == \"Top 100\",], as.character)\nwords_top_10 &lt;- lapply(df_top_words[df_top_words$chart_group == \"Top 10\",], as.character)<\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<p>Breaking the data frame into testing and training group.  <\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:r decode:true \"># 50, 10, 20\nlibrary(purrr)\nlibrary(tidyr)\n\n\nset.seed(8020)\n\ntest_lyric &lt;- df_songs_lyrics %&gt;%\n    mutate(uid = seq(1,length(df_songs_lyrics$album_name))) %&gt;%\n    group_by(chart_group) %&gt;%\n    nest() %&gt;%\n    ungroup() %&gt;%\n    mutate(n = c(50,20,10)) %&gt;%\n    mutate(samp = map2(data, n, sample_n)) %&gt;%\n    select(-data) %&gt;%\n    unnest(samp)\n\ntrain_lyric &lt;- df_songs_lyrics %&gt;%\n    mutate(uid = seq(1,length(df_songs_lyrics$album_name))) \n\ntrain_lyric &lt;- anti_join(train_lyric, test_lyric, by='uid')<\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<p>Use the custom function created earlier to convert all the text to lower case, remove any numbers, remove any punctuation, remove any stop words, and remove any extra white space in the text.<\/p>\n\n\n\n<p>Added the function lyric_features to add the additional columns for feature engineering. <\/p>\n\n\n\n<p>And lastly created the training and test data frames.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:r decode:true \"># put to lower case, remove punctuation, and stop words\ntrain_lyric$lyrics &lt;- lapply(train_lyric$lyrics, scrubLyrics)\ntest_lyric$lyrics &lt;- lapply(test_lyric$lyrics, scrubLyrics)\n\n# build into tidy versions of the dataframes\n# put into long data set\ntrain_lyric_scrubbed &lt;- train_lyric %&gt;%\n    select(-uid) %&gt;%\n    tidytext::unnest_tokens(t_words , lyrics)\n\ntest_lyric_scrubbed &lt;- test_lyric %&gt;%\n    select(-n,-uid) %&gt;%\n    tidytext::unnest_tokens(t_words , lyrics)\n\nlyric_features &lt;- function(lyric){\n    lf &lt;- lyric %&gt;%\n    group_by(song_title) %&gt;%\n    mutate(word_frequency = n()\n           , lexical_diversity = n_distinct(t_words)\n           , lexical_density = lexical_diversity \/ word_frequency\n           , reptition = word_frequency \/ lexical_diversity\n           , song_avg_word_length = mean(nchar(t_words))\n           , song_title_words = lengths(gregexpr(\"[A-z]\\\\W+\",song_title)) +1L\n           , song_title_length = nchar(song_title)\n           , large_word_count = sum(ifelse((nchar(t_words)&gt;7),1,0))\n           , small_word_count = sum(ifelse((nchar(t_words)&lt;3),1,0))\n           , top_10_word_count \n              = sum(ifelse(t_words %in% words_top_10$common_word,15,0))\n           , top_100_word_count \n              = sum(ifelse(t_words %in% words_top_100$common_word,5,0))\n           , uncharted_word_count \n              = sum(ifelse(t_words %in% words_not_charted$common_word,5,0))\n           ) %&gt;%\n      select(-t_words) %&gt;%\n      select(album_name             #1. chr\n             , song_title           #2. chr\n             , artist               #3. chr\n             , peek_date            #4. date\n             , charted              #5. chr\n             , NumberOne            #6. bool\n             , peek_rank            #7. num\n             , album_year           #8. num\n             , album_decade         #9. num\n             , word_frequency       #10. num\n             , lexical_diversity    #11. num\n             , lexical_density      #12. num\n             , reptition            #13. num\n             , song_avg_word_length #14. num\n             , song_title_words     #15. num\n             , song_title_length    #16. num\n             , large_word_count     #17. num\n             , small_word_count     #18. num\n             , top_10_word_count    #19. num\n             , top_100_word_count   #20. num\n             , uncharted_word_count #21. num\n             , chart_group          #22. factor 3 levels\n             ) %&gt;%\n              \n      distinct() %&gt;%\n      ungroup()\n    \n    lf$chart_group &lt;- as.factor(lf$chart_group)\n    return(lf)\n}\n\ntrain_data_fe &lt;- lyric_features(train_lyric_scrubbed)\ntest_data_fe &lt;- lyric_features(test_lyric_scrubbed)<\/pre><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Building the Model<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Normalize the datasets<\/h3>\n\n\n\n<p>Need to normalize the datasets for each of the models, using a range of 0 to 1.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:r decode:true \">col_nm &lt;-c(\"word_frequency\",\"lexical_diversity\",\"reptition\"\n           ,\"song_avg_word_length\",\"song_title_words\",\"song_title_length\"\n           ,\"large_word_count\",\"small_word_count\",\"top_10_word_count\"\n           ,\"top_100_word_count\",\"uncharted_word_count\"\n           )\ntrain_data_nm &lt;- normalizeFeatures(train_data_fe\n                                   ,method = \"standardize\"\n                                   ,cols=col_nm\n                                   ,range=c(0,1)\n                                   ,on.constant = \"quiet\")\ntest_data_nm &lt;- normalizeFeatures(test_data_fe\n                                   ,method = \"standardize\"\n                                   ,cols=col_nm\n                                   ,range=c(0,1)\n                                   ,on.constant = \"quiet\")<\/pre><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Create the Classifiers<\/h3>\n\n\n\n<p>Using a variety of models, including Naive Bayes, LDA, KSVM, KNN, RPart, Random Forest, XGBoost, and NNET.  <\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:r decode:true \"># will use a variety of models to see if any of the models\n# preform better with lyrics \n\nmodels = list(\n      makeLearner(\"classif.naiveBayes\", id = \"Naive Bayes\")\n      , makeLearner(\"classif.lda\", id = \"LDA\")\n      , makeLearner(\"classif.ksvm\", id = \"SVM\")\n      , makeLearner(\"classif.knn\", id = \"KNN\")\n      , makeLearner(\"classif.rpart\", id = \"RPART\", predict.type = \"prob\")\n      , makeLearner(\"classif.randomForest\", id = \"Random Forest\", predict.type = \"prob\")\n      , makeLearner(\"classif.xgboost\", id = \"XG Boost\", predict.type = \"prob\")\n      , makeLearner(\"classif.nnet\", id = \"Neural Net\", predict.type = \"prob\")\n)<\/pre><\/div>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \"># use cross fold validation\ncfold &lt;- makeResampleDesc(\"CV\" ,iters = 10, stratify = TRUE)\n\n# make classifiers\nexclude_cols = c(1:7)\ntrain_clf &lt;- makeClassifTask(id=\"Lyrics\"\n                             , data = train_data_nm[-exclude_cols]\n                             , target = \"chart_group\"\n                             )\n\ntest_clf &lt;- makeClassifTask(id=\"Lyrics\"\n                             , data = test_data_nm[-exclude_cols]\n                             , target = \"chart_group\"\n                             )\n\nlyric_train_benchmark &lt;- benchmark(models\n                         ,tasks = train_clf\n                         ,resamplings = cfold\n                         ,measures = list(acc, timetrain) \n                         ,show.info = FALSE\n                         )\n\n<\/pre><\/div>\n\n\n\n<pre class=\"wp-block-preformatted\">## # weights:  57\n## initial  value 1047.511809 \n## final  value 621.130815 \n## converged\n## # weights:  57\n## initial  value 704.646831 \n## final  value 618.710946 \n## converged\n## # weights:  57\n## initial  value 774.548012 \n## final  value 621.555100 \n## converged\n## # weights:  57\n## initial  value 1126.475235 \n## final  value 621.130815 \n## converged\n## # weights:  57\n## initial  value 759.930246 \n## final  value 618.710946 \n## converged\n## # weights:  57\n## initial  value 875.447375 \n## final  value 621.555100 \n## converged\n## # weights:  57\n## initial  value 1810.948881 \n## final  value 621.130815 \n## converged\n## # weights:  57\n## initial  value 768.154519 \n## final  value 621.555100 \n## converged\n## # weights:  57\n## initial  value 695.485397 \n## final  value 619.133874 \n## converged\n## # weights:  57\n## initial  value 626.916621 \n## final  value 621.130815 \n## converged<\/pre>\n\n\n\n<p><\/p>\n\n\n\n<p>Run the benchmark for each of the models and find out which model preforms best.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:r decode:true \"># Run the benchmark\nlyric_train_benchmark<\/pre><\/div>\n\n\n\n<p>It looks like that Random Forest has the best results. Ok. It will be what I will use in the testing set.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">##   task.id    learner.id acc.test.mean timetrain.test.mean\n## 1  Lyrics   Naive Bayes     0.8706484              0.0039\n## 2  Lyrics           LDA     0.8560566              0.0119\n## 3  Lyrics           SVM     0.8756448              0.0771\n## 4  Lyrics           KNN     0.7680821              0.0001\n## 5  Lyrics         RPART     0.9011682              0.0056\n## 6  Lyrics Random Forest     0.9109544              0.3485\n## 7  Lyrics      XG Boost     0.8999476              0.0038\n## 8  Lyrics    Neural Net     0.6544656              0.0055<\/pre>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Plot Training Results<\/h2>\n\n\n\n<p>Here we can see a nice plot of how the different models performed.  The only models that did terrible were KNN, and NNET.  The rest of the models were above 0.85. <\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:r decode:true \">plotBMRSummary(lyric_train_benchmark)<\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"432\" src=\"https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/10_train_results.png\" alt=\"\" class=\"wp-image-405\" srcset=\"https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/10_train_results.png 700w, https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/10_train_results-300x185.png 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Confusion Matrix<\/h3>\n\n\n\n<p>Now looking at confusion matrix for our Random Forest Model. The matrix shows us the correctly identified and incorrectly identified items. <\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:r decode:true \">predictions_train &lt;- getBMRPredictions(lyric_train_benchmark)\ncalculateConfusionMatrix(predictions_train$Lyrics$`Random Forest`)$result<\/pre><\/div>\n\n\n\n<pre class=\"wp-block-preformatted\">##              predicted\n## true          Not Charted Top 10 Top 100 -err.-\n##   Not Charted         524      0      12     12\n##   Top 10               21     50       2     23\n##   Top 100              37      1     172     38\n##   -err.-               58      1      14     73<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Feature Importance<\/h3>\n\n\n\n<p>Now to see what features contribute to the aiding in the prediction of the model.  Top 10 word count and top 100 word count are the most significant.  And as one would imagine the length of the song title has almost no significance. <\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:r decode:true \"># feature importance\nfeature_importance &lt;- generateFilterValuesData(task = train_clf\n                                               ,method = c(\"FSelector_information.gain\", \"FSelector_chi.squared\")\n                                               )\nplotFilterValues(feature_importance,n.show = 20)<\/pre><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"432\" src=\"https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/09_feature_values.png\" alt=\"\" class=\"wp-image-406\" srcset=\"https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/09_feature_values.png 700w, https:\/\/eipsoftware.com\/musings\/wp-content\/uploads\/2021\/10\/09_feature_values-300x185.png 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Testing the Model<\/h3>\n\n\n\n<p>Our final step is to test the model.  Our accuracy is 0.925.  The model&#8217;s performance is quite acceptable.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:r decode:true \">rf_model &lt;- train(\"classif.randomForest\",train_clf)\nresult_rf &lt;- predict(rf_model, test_clf) \nperformance(result_rf, measures = acc)<\/pre><\/div>\n\n\n\n<pre class=\"wp-block-preformatted\">##   acc \n## 0.925<\/pre>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:r decode:true \">calculateConfusionMatrix(pred = result_rf)<\/pre><\/div>\n\n\n\n<pre class=\"wp-block-preformatted\">##              predicted\n## true          Not Charted Top 10 Top 100 -err.-\n##   Not Charted          50      0       0      0\n##   Top 10                4      6       0      4\n##   Top 100               2      0      18      2\n##   -err.-                6      0       0      6<\/pre>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Enhancing the Model<\/h2>\n\n\n\n<p>Naturally I am only looking at small subset of all artists. The way to increase the accuracy of the model would be to increase the number of artists, and bring in their lyrics.  <\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this step, I will look at the data to see if we can do any feature engineering. And then I will edit the data for the model, train multiple models, evaluate the best model and then test the model. Let&#8217;s get started. Part 4 of 4 Steps for Creating the Model Song Summary Visualizations [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[58,5,59,41],"tags":[],"series":[],"class_list":["post-387","post","type-post","status-publish","format-standard","hentry","category-datascience","category-r","category-songlyrics","category-visualization-r"],"_links":{"self":[{"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/posts\/387","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/comments?post=387"}],"version-history":[{"count":11,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/posts\/387\/revisions"}],"predecessor-version":[{"id":417,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/posts\/387\/revisions\/417"}],"wp:attachment":[{"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/media?parent=387"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/categories?post=387"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/tags?post=387"},{"taxonomy":"series","embeddable":true,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/series?post=387"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}