To obtain the data I am going to use Beautiful Soup and a few other packages to scrap the content from the websites.

Part 2 of 4

Steps for Getting the Data

  1. Get the Songs Made by the Artists
  2. Extract the List of Songs by the Artist
  3. Scrap the Lyrics for Each Song by the Artist
  4. Extract the Lyrics from Each File
  5. Scrap Rankings by Artists
  6. Parse the Song Rankings Files

The first step is getting the list of the songs by the artist. I am using the website http://www.azlyrics.com to obtain the list of songs and the lyrics for the songs.

Step 1 – Get the Songs Made by the Artist

I am setting a few local variables, the name of the artist, the webpage I want to download, and where I want to save the file. In the example below I am copying the lists of songs by Rihanna.

Step 1.01 Save the .html locally for processing

Next I use the Requests package to go out to the website and copy all of the .html and save the file locally.

Step 2 Extract the list of Songs by the Artist

We are going to use the .html file we saved in the prior step and extract a list of songs and the URL to the song’s lyrics. I will save the list into a .csv file to be used in the next step.

Again I am setting some local variables, name of the artist, how to search for the artist in the html file and the file location for the .html that was saved in step 1.

Step 2.01 Open the File

Straight forward I will open one of the .html files I created in step one.

Step 2.02 Parse with Beautiful Soup

Now using the Beautiful Soup package the package will parse the .html file.

First I will get the list of all the links in the .html file. Next I will look for the key search word, which is the name of the artist, and the value lyrics, since the website saves all of the song lyrics in a lyrics directory. If the conditions are true I store the name of the song and the URL in a dictionary.

And for debugging purposes I print out all of the song lyric URLs.

Step 2.03 Save the URLS to a local file

Now that we have the dictionary with the song title and the URL of where the song’s lyrics are we can open a file and write the values to the file, utilizing the csv package.

Step 3 Scrap the Lyrics for Each Song by the Artist

Now I will use the list of songs and the corresponding URLs that I saved in a csv file previous step. For each URL I will save the .html file locally in order to be parsed in the next step. I added a few variables to allow for restarting at a specific row in the csv file in case the requests timed out, or were not saved correctly.

Step 3.01 Open the csv file and Retrieve the Song Name and URL of the Lyrics

Open the csv file saved in the prior step using the csv package. Conveniently I saved the csv file using the artist name. The loop will extract each song name and URL in put them in to the dictionary.

Step 3.02 Retrieve the .html File and Save Locally

Now with the song name and the URL location stored in the dictionary, use the requests package to get the .html file. I will save the .html file with the song name in a separate directory for each artist.

The source website will block our small web scrapping routine if we make too many requests at the same time. I use the time package to pause the loop between requests. It is also the reason I didn’t parallelize the requests to the source web server. Additionally since I am saving the .html locally and the content of the lyrics are static in nature, I don’t need to repeatedly check if the .html content has changed.

Step 4 Extract the Lyrics from Each File

Step 4.01 Open the .csv File from Prior Step

Will open the .csv files saved in the step 3. After opening the file, will read each row and store the song name and the URL in a dictionary.

Step 4.02 Extract the Lyrics from the .html File

The heart of the process, I open each .html file that was saved in step 3. With the file I use the Beautiful Soup package to parse through all of the .html elements. The lyrics have a key phrase in the .html that I can search for. I used the variable search_phrase to find the element. The element will indicate that the next element will start to contain the song lyrics.

I append the contents of the div element until I reach an element that isn’t a div element. This particular pattern matching is unique to this particular website. It will not always be applicable to all web sites.

The last bit of this section is to do a quick substitution. I wrote a custom function to help with helping the substitution. When extracting the content of the div elements, the last div was superfluous and can be removed. The apostrophe was stored in a unique format that I substituted and when encountering the ellipses I remove the text. Using the dictionary method of text substitution allows for rapid of substitution of a large amount of pre-defined phrases that need to be ignored from the .html elements.

Additionally I created another custom function using the regex package to remove leading and trailing brackets from the text.

Step 4.03 Save the File

Now that the lyrics have been extracted I can save the plain text of the lyrics into a local file.

Step 5 Scrap Rankings by Artists for Each Song

Next, need to get the rankings of each song. I am using the Billboard Hot 100 to determine how well a song did commercially. The rank is in ascending order. The song that sold the most copies and where applicable had the most streams will be ranked first. The ranking is unique and doesn’t align with how other sources may rank the commercial success of a song in prior periods. The rankings are issued weekly.

Using the requests package will get the .html page from the source website. The rankings were organized in a multiple page layout. I used the variables starting_row and ending_row to determine how many pages of rankings were available for the artists.

I output to the screen which ranking page I am currently fetching. After the page is fetched from the remote server, I save the .html file to a local file for parsing in the next step.

I use the time package to put in a delay to prevent being blocked from fetching the web page content.

Step 6 Parse the Song Rankings Files

Similar to how the lyrics were extracted from the .html files, I will follow a similar process and extract the rankings for each song that appeared on the Billboard Hot 100 for the artist. Use the glob package to help store the list of files in the directory to process.

Step 6.01 Parse the Rankings Files for the Artist

For each rankings file in the directory, I use the Beautiful Soup package will search through the html elements looking for key phrases to determine what type of content is in the div. The search phrase for each type of content is unique. I am able to use the class part of the in the link or div element to determine the content.

Step 6.02 Put the Song Details into a Data Frame

Use the pandas package, write the extracted information to a data frame. I use the data frame to aid with exporting the information to a tab delimited file.

If there are any extraneous blank lines or lines with the phrase “Peaked at” from the data frame. The rows are not needed for processing.

Step 6.03 Save to a Tab Delimited File

And the last step will use the csv package to save the content locally to a tab delimited file.