To obtain the data I am going to use Beautiful Soup and a few other packages to scrap the content from the websites.
Part 2 of 4
Steps for Getting the Data
- Get the Songs Made by the Artists
- Extract the List of Songs by the Artist
- Scrap the Lyrics for Each Song by the Artist
- Extract the Lyrics from Each File
- Scrap Rankings by Artists
- Parse the Song Rankings Files
The first step is getting the list of the songs by the artist. I am using the website http://www.azlyrics.com to obtain the list of songs and the lyrics for the songs.
Step 1 – Get the Songs Made by the Artist
I am setting a few local variables, the name of the artist, the webpage I want to download, and where I want to save the file. In the example below I am copying the lists of songs by Rihanna.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# ---------------------------------------------------------------------------- # step 1 # get the list of songs made by the artist # ---------------------------------------------------------------------------- import requests test_level = 1000 artist = "Rihanna" get_page = "https://www.azlyrics.com/r/rihanna.html" base_dir = "Documents/webscrap/lyrics/" + artist +"_html/" html_filename = base_dir + artist + "_songs.html" |
Step 1.01 Save the .html locally for processing
Next I use the Requests package to go out to the website and copy all of the .html and save the file locally.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# ---------------------------------------------------------------------------- # step 1.01 # go out to website, grab html page with song titles and save locally # ---------------------------------------------------------------------------- # get the html page song_page = requests.get(get_page) # show the status of the page print(F"Page: {get_page} Status: {song_page.status_code}") with open(html_filename ,'w') as html_file: html_file.write(song_page.text) print(F"saved: {html_filename}") |
Step 2 Extract the list of Songs by the Artist
We are going to use the .html file we saved in the prior step and extract a list of songs and the URL to the song’s lyrics. I will save the list into a .csv file to be used in the next step.
Again I am setting some local variables, name of the artist, how to search for the artist in the html file and the file location for the .html that was saved in step 1.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# ---------------------------------------------------------------------------- # step 2 # from the songs html file, parse the song titles and URLs to the lyrics # ---------------------------------------------------------------------------- import csv from bs4 import BeautifulSoup artist = "Taylor-Swift" searchword = "taylorswift" base_dir = "/home/george/Documents/webscrap/lyrics/" + artist + "_html/" song_list = base_dir + artist +"_songs.html" |
Step 2.01 Open the File
Straight forward I will open one of the .html files I created in step one.
1 2 3 4 5 6 7 |
# ---------------------------------------------------------------------------- # step 2.01 # open the saved html file from the local disk # ---------------------------------------------------------------------------- with open(song_list) as song_html_file: read_data = song_html_file.read() |
Step 2.02 Parse with Beautiful Soup
Now using the Beautiful Soup package the package will parse the .html file.
First I will get the list of all the links in the .html file. Next I will look for the key search word, which is the name of the artist, and the value lyrics, since the website saves all of the song lyrics in a lyrics directory. If the conditions are true I store the name of the song and the URL in a dictionary.
And for debugging purposes I print out all of the song lyric URLs.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
# ---------------------------------------------------------------------------- # step 2.02 # parse the file using BeautifulSoup # ---------------------------------------------------------------------------- # create the beautiful soup object song_page = BeautifulSoup(read_data, 'html.parser') # find all of the URLS in the webpage links = song_page.find_all("a") # root page of lyrics root = "https://www.azlyrics.com" # dictionary of song links song_links ={} for link in links: # check if the word pink and lyrcis is in the URL if searchword in link.get("href") and 'lyrics' in link.get("href"): # extract the URL song_link = link.get('href') # replace the relative URL with an absolute URL # store in the dictionary song_links[link.text] = root + song_link.replace(".." , "") # prints out the URLS for link_text in song_links: song_link = song_links[link_text] print (song_link) |
Step 2.03 Save the URLS to a local file
Now that we have the dictionary with the song title and the URL of where the song’s lyrics are we can open a file and write the values to the file, utilizing the csv package.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# ---------------------------------------------------------------------------- # step 2.03 # save the URLS to a local file # use CSV # ---------------------------------------------------------------------------- file_name = base_dir + artist + "_lyrics.csv" with open(file_name,mode='w') as csv_file: field_names = ['song_name' ,'song_url'] writer = csv.DictWriter(csv_file, fieldnames=field_names) # write header row in the csv file writer.writeheader() # write the output for link_text in song_links: writer.writerow({'song_name': link_text, 'song_url':song_links[link_text]}) # end |
Step 3 Scrap the Lyrics for Each Song by the Artist
Now I will use the list of songs and the corresponding URLs that I saved in a csv file previous step. For each URL I will save the .html file locally in order to be parsed in the next step. I added a few variables to allow for restarting at a specific row in the csv file in case the requests timed out, or were not saved correctly.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# ---------------------------------------------------------------------------- # step 3 # use the list of song titles and URLs and save html locally # ---------------------------------------------------------------------------- import random import csv import time import requests test_level = 1000 starting_row = 204 ending_row = 400 artist = "Drake" base_dir = "/home/george/Documents/webscrap/lyrics/" + artist + "_html/" |
Step 3.01 Open the csv file and Retrieve the Song Name and URL of the Lyrics
Open the csv file saved in the prior step using the csv package. Conveniently I saved the csv file using the artist name. The loop will extract each song name and URL in put them in to the dictionary.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# ---------------------------------------------------------------------------- # step 3.01 # open the saved .csv file from the local disk # ---------------------------------------------------------------------------- csv_filename = base_dir + artist + "_lyrics.csv" song_links = {} with open(csv_filename, mode='r') as csv_file: csv_reader = csv.DictReader(csv_file) for row in csv_reader: song_links[row["song_name"]] = row["song_url"] print(F"rows in file: {len(song_links)}") |
Step 3.02 Retrieve the .html File and Save Locally
Now with the song name and the URL location stored in the dictionary, use the requests package to get the .html file. I will save the .html file with the song name in a separate directory for each artist.
The source website will block our small web scrapping routine if we make too many requests at the same time. I use the time package to pause the loop between requests. It is also the reason I didn’t parallelize the requests to the source web server. Additionally since I am saving the .html locally and the content of the lyrics are static in nature, I don’t need to repeatedly check if the .html content has changed.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
# ---------------------------------------------------------------------------- # step 3.02 # go out to website, grab html page with lyrics, save locally # ---------------------------------------------------------------------------- counter = 0 # loop through the song urls in the dictionary for song_name in song_links: # check where we are in getting items from the dictionary if counter >= starting_row and counter <= ending_row: # get the html page lyric_page = requests.get(song_links[song_name]) # if the page is successfully retrieved then save it if lyric_page.status_code == 200: with open(base_dir + song_name + '.html' ,'w') as html_file: html_file.write(lyric_page.text) print(F"Retreived item: {counter} - {song_name}") else: print(F"ERROR getting: {counter} - {song_links[song_name]}") # wait so we don't overload remote site time.sleep(random.randint(2,5)) counter +=1 # end of file |
Step 4 Extract the Lyrics from Each File
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
# ---------------------------------------------------------------------------- # step 4 # use the html lyric files saved locally and extract the lyrics # ---------------------------------------------------------------------------- import csv import re from bs4 import BeautifulSoup, Comment test_level = 1000 artist = "U2" base_dir = "/home/george/Documents/webscrap/lyrics/" + artist + "_html/" dest_dir = "/home/george/Documents/webscrap/lyrics/" + artist + "_Lyrics/" search_phrase = "third-party lyrics provider" substitution_words = {"MxM banner": "" # remove phrase at end of text ,"’": "'" # convert utf-8 into apostrophe ,"…": "" # remove "..." } def dictionary_replace(text ,replacement_dictionary): for search_term, replacement_term in replacement_dictionary.items(): text = text.replace(search_term, replacement_term) return text regex_pattern = re.compile(r"\[(.*?)\]",re.IGNORECASE) def regex_replace(text): text = regex_pattern.sub("",text) return(text) |
Step 4.01 Open the .csv File from Prior Step
Will open the .csv files saved in the step 3. After opening the file, will read each row and store the song name and the URL in a dictionary.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# ---------------------------------------------------------------------------- # step 4.01 # open the saved .csv file from the local disk # ---------------------------------------------------------------------------- csv_filename = base_dir + artist + "_lyrics.csv" song_links = {} with open(csv_filename, mode='r') as csv_file: csv_reader = csv.DictReader(csv_file) for row in csv_reader: song_links[row["song_name"]] = row["song_url"] print(F"rows in file: {len(song_links)}") counter = 0 |
Step 4.02 Extract the Lyrics from the .html File
The heart of the process, I open each .html file that was saved in step 3. With the file I use the Beautiful Soup package to parse through all of the .html elements. The lyrics have a key phrase in the .html that I can search for. I used the variable search_phrase to find the element. The element will indicate that the next element will start to contain the song lyrics.
I append the contents of the div element until I reach an element that isn’t a div element. This particular pattern matching is unique to this particular website. It will not always be applicable to all web sites.
The last bit of this section is to do a quick substitution. I wrote a custom function to help with helping the substitution. When extracting the content of the div elements, the last div was superfluous and can be removed. The apostrophe was stored in a unique format that I substituted and when encountering the ellipses I remove the text. Using the dictionary method of text substitution allows for rapid of substitution of a large amount of pre-defined phrases that need to be ignored from the .html elements.
Additionally I created another custom function using the regex package to remove leading and trailing brackets from the text.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
# ---------------------------------------------------------------------------- # step 4.02 # loop through the html files in the dictionary # ---------------------------------------------------------------------------- for song_name in song_links: counter +=1 if counter >= test_level: break print(F"Opening item: {counter} - {song_name}") with open(base_dir + song_name + '.html') as current_file: read_html_file = current_file.read() # create the beautiful soup object lyric_page = BeautifulSoup(read_html_file, 'html.parser') # string to hold the lyric text lyric_text = "" for comment in lyric_page.findAll(text=lambda text:isinstance(text, Comment)): if search_phrase in comment: next_el = comment.next_element # search for the closing div while next_el.name != "div": if next_el.name is None: #append the text to the lyric string lyric_text = lyric_text + next_el.string next_el = next_el.next_element # clean up some scrub text in the lyrics file lyric_text = dictionary_replace(lyric_text, substitution_words) lyric_text = regex_replace(lyric_text) #print(lyric_text) |
Step 4.03 Save the File
Now that the lyrics have been extracted I can save the plain text of the lyrics into a local file.
1 2 3 4 5 6 7 |
# ---------------------------------------------------------------------------- # step 4.03 # write the file to the local file # ---------------------------------------------------------------------------- print(F"Saving file: {counter} - {song_name}.txt") with open(dest_dir + song_name + '.txt' ,'w') as text_file: text_file.write(lyric_text) |
Step 5 Scrap Rankings by Artists for Each Song
Next, need to get the rankings of each song. I am using the Billboard Hot 100 to determine how well a song did commercially. The rank is in ascending order. The song that sold the most copies and where applicable had the most streams will be ranked first. The ranking is unique and doesn’t align with how other sources may rank the commercial success of a song in prior periods. The rankings are issued weekly.
Using the requests package will get the .html page from the source website. The rankings were organized in a multiple page layout. I used the variables starting_row and ending_row to determine how many pages of rankings were available for the artists.
I output to the screen which ranking page I am currently fetching. After the page is fetched from the remote server, I save the .html file to a local file for parsing in the next step.
I use the time package to put in a delay to prevent being blocked from fetching the web page content.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
# ---------------------------------------------------------------------------- # step five # go out to website, grab html page with rankings, save locally # ---------------------------------------------------------------------------- import csv import time import requests # test_level = 1000 starting_row = 1 ending_row = 7 artist = "Rihanna" base_dir = "Documents/ranks/" + artist + "/" base_html = "https://www.billboard.com/music/" + artist.lower() +"/chart-history/HSI" counter = starting_row # check where we are in getting items from the dictionary while counter >= starting_row and counter <= ending_row: # get the ranking page if counter > 1: get_page = base_html +"/" + str(counter) else: get_page = base_html print(F"Fetching: {get_page}") # get the html page ranking_page = requests.get(get_page) # show the status of the page print(F"Page: {counter} Status: {ranking_page.status_code}") with open(base_dir + artist + '_' + str(counter) + '.html' ,'w') as html_file: html_file.write(ranking_page.text) # wait so we don't overload remote site if counter <= ending_row: time.sleep(3) counter +=1 |
Step 6 Parse the Song Rankings Files
Similar to how the lyrics were extracted from the .html files, I will follow a similar process and extract the rankings for each song that appeared on the Billboard Hot 100 for the artist. Use the glob package to help store the list of files in the directory to process.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# ---------------------------------------------------------------------------- # step 6 # use the html rank files saved locally and extract the rankings per song # ---------------------------------------------------------------------------- import csv import glob from bs4 import BeautifulSoup import pandas as pd test_level = 1000 artist = "U2" base_dir = "/home/george/Documents/webscrap/ranks/" + artist + "/" dir_list = glob.glob("/home/george/Documents/webscrap/ranks/" + artist + "/*.html") dest_dir = "/home/george/Documents/webscrap/ranks/" + artist + ".txt" |
Step 6.01 Parse the Rankings Files for the Artist
For each rankings file in the directory, I use the Beautiful Soup package will search through the html elements looking for key phrases to determine what type of content is in the div. The search phrase for each type of content is unique. I am able to use the class part of the in the link or div element to determine the content.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
# ---------------------------------------------------------------------------- # step 6.01 # loop through the html files in the directory # ---------------------------------------------------------------------------- # dictionary of song ranks song_ranks ={} counter = 0 # unique div names div_ranking = "artist-section--chart-history__title-list" div_song_title = "artist-section--chart-history__title-list__title__text" div_song_title_link = "artist-section--chart-history__title-list__title__text--title" div_song_artist = "artist-section--chart-history__title-list__title__text--artist-name" div_song_peak = "artist-section--chart-history__title-list__title__text--peak-rank" # list to hold values songs = [] peek_rank = [] peek_date = [] for file_name in dir_list: counter +=1 if counter >= test_level: break print(F"Opening item: {counter} - {file_name}") with open(file_name) as current_file: read_html_file = current_file.read() # create the beautiful soup object rank_page = BeautifulSoup(read_html_file, 'html.parser') # string to hold the ranking html ranking_text = rank_page.find("div", {"class" : div_ranking}) ranking_html = BeautifulSoup(read_html_file, "html.parser") for song_title in ranking_html.find_all("a", {"class" : div_song_title_link}): songs.append(song_title.string) for song_rank in ranking_html.find_all("div", {"class" : div_song_peak }): peek_rank.append(song_rank.next_element.string) peek_date.append(song_rank.find("a").string) |
Step 6.02 Put the Song Details into a Data Frame
Use the pandas package, write the extracted information to a data frame. I use the data frame to aid with exporting the information to a tab delimited file.
If there are any extraneous blank lines or lines with the phrase “Peaked at” from the data frame. The rows are not needed for processing.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# ---------------------------------------------------------------------------- # step 6.02 # put list into a dataframe # ---------------------------------------------------------------------------- # merge the lists song_details = list(zip(songs,peek_rank,peek_date)) # write to data frame song_df = pd.DataFrame(song_details, columns=["song_title" ,"peek_rank" ,"peek_date"]) # remove new lines from the data frame song_df = song_df.replace(r'\\n','', regex=True) song_df = song_df.replace(r'\n','', regex=True) song_df = song_df.replace(r'Peaked at ','', regex=True) |
Step 6.03 Save to a Tab Delimited File
And the last step will use the csv package to save the content locally to a tab delimited file.
1 2 3 4 5 6 7 |
# ---------------------------------------------------------------------------- # step 6.03 # save the text to a local file # use TSV # ---------------------------------------------------------------------------- song_df.to_csv(dest_dir, sep="\t") |
Leave a Reply