{"id":330,"date":"2020-02-10T22:00:00","date_gmt":"2020-02-10T22:00:00","guid":{"rendered":"https:\/\/eipsoftware.com\/musings\/?p=330"},"modified":"2021-10-03T20:56:16","modified_gmt":"2021-10-03T20:56:16","slug":"lyrical-success-getting-the-data","status":"publish","type":"post","link":"https:\/\/eipsoftware.com\/musings\/lyrical-success-getting-the-data\/","title":{"rendered":"Lyrical Success &#8211; Getting the Data"},"content":{"rendered":"\n<p class=\"has-normal-font-size wp-block-paragraph\">To obtain the data I am going to use Beautiful Soup and a few other packages to scrap the content from the websites. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Part 2 of 4<\/h2>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Steps for Getting the Data<\/strong><\/h2>\n\n\n\n<ol class=\"wp-block-list\" style=\"font-size:22px\"><li><a href=\"#get-songs-by-artist\">Get the Songs Made by the Artists<\/a><\/li><li><a href=\"#extract-list-songs-by-artist\">Extract the List of Songs by the Artist<\/a><\/li><li><a href=\"#web-scrap-lyrics-each-song\">Scrap the Lyrics for Each Song by the Artist<\/a><\/li><li><a href=\"#extract-lyrics-from-html-file\">Extract the Lyrics from Each File<\/a><\/li><li><a href=\"#web-scarp-rankings-artist\">Scrap Rankings by Artists<\/a><\/li><li><a href=\"#parse-song-rankings\">Parse the Song Rankings Files<\/a><\/li><\/ol>\n\n\n\n<p class=\"has-normal-font-size wp-block-paragraph\">The first step is getting the list of the songs by the artist.  I am using the website <a href=\"http:\/\/www.azlyrics.com\">http:\/\/www.azlyrics.com<\/a> to obtain the list of songs and the lyrics for the songs.<\/p>\n\n\n\n<!--more-->\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"has-large-font-size wp-block-paragraph\" id=\"get-songs-by-artist\"><strong>Step 1 &#8211; Get the Songs Made by the Artist<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I am setting a few local variables, the name of the artist, the <a href=\"https:\/\/www.azlyrics.com\/r\/rihanna.html\">webpage<\/a> I want to download, and where I want to save the file. In the example below I am copying the lists of songs by Rihanna.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"width:800 lang:python range:1-100 decode:true \"># ----------------------------------------------------------------------------\n# step 1 \n# get the list of songs made by the artist\n# ----------------------------------------------------------------------------\n\nimport requests\n\ntest_level = 1000\n\nartist = \"Rihanna\"\nget_page = \"https:\/\/www.azlyrics.com\/r\/rihanna.html\"\n\nbase_dir = \"Documents\/webscrap\/lyrics\/\" + artist +\"_html\/\"\nhtml_filename = base_dir + artist + \"_songs.html\"<\/pre><\/div>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" style=\"font-size:20px\"><strong>Step 1.01 Save the .html locally for processing<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Next I use the Requests package to go out to the website and copy all of the .html and save the file locally.  <\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"width:800 lang:python decode:true \"># ----------------------------------------------------------------------------\n# step 1.01 \n# go out to website, grab html page with song titles and save locally\n# ----------------------------------------------------------------------------\n\n# get the html page\nsong_page = requests.get(get_page)\n# show the status of the page\nprint(F\"Page: {get_page}  Status: {song_page.status_code}\")\n\nwith open(html_filename ,'w') as html_file:\n    html_file.write(song_page.text)\n\nprint(F\"saved: {html_filename}\")<\/pre><\/div>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"has-large-font-size wp-block-paragraph\" id=\"extract-list-songs-by-artist\"><strong>Step 2 Extract the list of Songs by the Artist<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We are going to use the .html file we saved in the prior step and extract a list of songs and the URL to the song&#8217;s lyrics. I will save the list into a .csv file to be used in the next step. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Again I am setting some local variables, name of the artist, how to search for the artist in the html file and the file location for the .html that was saved in step 1.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"width:800 lang:python decode:true \"># ----------------------------------------------------------------------------\n# step 2 \n# from the songs html file, parse the song titles and URLs to the lyrics\n# ----------------------------------------------------------------------------\n\nimport csv\nfrom bs4 import BeautifulSoup\n\n\nartist = \"Taylor-Swift\"\nsearchword = \"taylorswift\"\n\nbase_dir = \"\/home\/george\/Documents\/webscrap\/lyrics\/\" + artist + \"_html\/\"\nsong_list = base_dir + artist +\"_songs.html\" \n<\/pre><\/div>\n\n\n\n<p class=\"wp-block-paragraph\" style=\"font-size:20px\"><strong>Step 2.01 Open the File<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Straight forward I will open one of the .html files I created in step one.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"width:800 lang:python range:1-100 decode:true \"># ----------------------------------------------------------------------------\n# step 2.01 \n# open the saved html file from the local disk\n# ----------------------------------------------------------------------------\n\nwith open(song_list) as song_html_file:\n    read_data = song_html_file.read()<\/pre><\/div>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" style=\"font-size:20px\"><strong>Step 2.02 Parse with Beautiful Soup<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now using the Beautiful Soup package the package will parse the .html file. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">First I will get the list of all the links in the .html file. Next I will look for the key search word, which is the name of the artist, and the value lyrics, since the website saves all of the song lyrics in a lyrics directory.  If the conditions are true I store the name of the song and the URL in a dictionary.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">And for debugging purposes I print out all of the song lyric URLs.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"width:800 lang:python range:1-100 decode:true \"># ----------------------------------------------------------------------------\n# step 2.02 \n# parse the file using BeautifulSoup\n# ----------------------------------------------------------------------------\n\n# create the beautiful soup object\nsong_page = BeautifulSoup(read_data, 'html.parser')\n\n# find all of the URLS in the webpage\nlinks = song_page.find_all(\"a\")\n\n# root page of lyrics\nroot = \"https:\/\/www.azlyrics.com\"\n\n# dictionary of song links\nsong_links ={}\n\nfor link in links:\n    # check if the word pink and lyrcis is in the URL\n    if searchword in link.get(\"href\") and 'lyrics' in link.get(\"href\"):\n        # extract the URL\n        song_link = link.get('href')\n        # replace the relative URL with an absolute URL\n        # store in the dictionary\n        song_links[link.text] = root + song_link.replace(\"..\" , \"\")\n\n# prints out the URLS\nfor link_text in song_links:\n    song_link = song_links[link_text]\n    print (song_link)<\/pre><\/div>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" style=\"font-size:20px\"><strong>Step 2.03 Save the URLS to a local file<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now that we have the dictionary with the song title and the URL of where the song&#8217;s lyrics are we can open a file and write the values to the file, utilizing the csv package. <\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"width:800 lang:python range:1-100 decode:true \"># ----------------------------------------------------------------------------\n# step 2.03 \n# save the URLS to a local file\n# use CSV\n# ----------------------------------------------------------------------------\nfile_name = base_dir + artist + \"_lyrics.csv\"\nwith open(file_name,mode='w') as csv_file:\n    field_names = ['song_name' ,'song_url']\n    writer = csv.DictWriter(csv_file, fieldnames=field_names)\n    # write header row in the csv file\n    writer.writeheader()\n    # write the output\n    for link_text in song_links:\n        writer.writerow({'song_name': link_text, 'song_url':song_links[link_text]})\n\n\n# end<\/pre><\/div>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"has-large-font-size wp-block-paragraph\" id=\"web-scrap-lyrics-each-song\"><strong>Step 3 Scrap the Lyrics for Each Song by the Artist<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now I will use the list of songs and the corresponding URLs that I saved in a csv file previous step.  For each URL I will save the .html file locally in order to be parsed in the next step.  I added a few variables to allow for restarting at a specific row in the csv file in case the requests timed out, or were not saved correctly.  <\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \"># ----------------------------------------------------------------------------\n# step 3\n# use the list of song titles and URLs and save html locally\n# ----------------------------------------------------------------------------\nimport random\nimport csv\nimport time\nimport requests\n\ntest_level = 1000\nstarting_row = 204\nending_row = 400\n\nartist = \"Drake\"\nbase_dir = \"\/home\/george\/Documents\/webscrap\/lyrics\/\" + artist + \"_html\/\"<\/pre><\/div>\n\n\n\n<p class=\"wp-block-paragraph\" style=\"font-size:20px\"><strong>Step 3.01 Open the csv file and Retrieve the Song Name and URL of the Lyrics<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Open the csv file saved in the prior step using the csv package. Conveniently I saved the csv file using the artist name.  The loop will extract each song name and URL in put them in to the dictionary.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \"># ----------------------------------------------------------------------------\n# step 3.01 \n# open the saved .csv file from the local disk\n# ----------------------------------------------------------------------------\ncsv_filename = base_dir + artist + \"_lyrics.csv\"\n\nsong_links = {}\n\nwith open(csv_filename, mode='r') as csv_file:\n    csv_reader = csv.DictReader(csv_file)\n    for row in csv_reader:\n        song_links[row[\"song_name\"]] = row[\"song_url\"]\n\nprint(F\"rows in file: {len(song_links)}\")<\/pre><\/div>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" style=\"font-size:20px\"><strong>Step 3.02 Retrieve the .html File and Save Locally<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now with the song name and the URL location stored in the dictionary, use the requests package to get the .html file.  I will save the .html file with the song name in a separate directory for each artist. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The source website will block our small web scrapping routine if we make too many requests at the same time. I use the time package to pause the loop between requests. It is also the reason I didn&#8217;t parallelize the requests to the source web server.  Additionally since I am saving the .html locally and the content of the lyrics are static in nature, I don&#8217;t need to repeatedly check if the .html content has changed.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \"># ----------------------------------------------------------------------------\n# step 3.02 \n# go out to website, grab html page with lyrics, save locally\n# ----------------------------------------------------------------------------\n\ncounter = 0\n\n# loop through the song urls in the dictionary\nfor song_name in song_links:\n    \n    # check where we are in getting items from the dictionary\n    if counter &gt;= starting_row and counter &lt;= ending_row:\n        # get the html page\n        lyric_page = requests.get(song_links[song_name])\n    \n        # if the page is successfully retrieved then save it\n        if lyric_page.status_code == 200:\n            with open(base_dir + song_name + '.html' ,'w') as html_file:\n                html_file.write(lyric_page.text)\n            print(F\"Retreived item: {counter} - {song_name}\")\n        else:\n            print(F\"ERROR getting: {counter} - {song_links[song_name]}\")\n\n        # wait so we don't overload remote site\n        time.sleep(random.randint(2,5))\n\n    counter +=1\n\n# end of file\n<\/pre><\/div>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"has-large-font-size wp-block-paragraph\" id=\"extract-lyrics-from-html-file\"><strong>Step 4 Extract the Lyrics from Each File<\/strong><\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \"># ----------------------------------------------------------------------------\n# step 4\n# use the html lyric files saved locally and extract the lyrics\n# ----------------------------------------------------------------------------\n\nimport csv\nimport re\nfrom bs4 import BeautifulSoup, Comment\n\ntest_level = 1000\nartist = \"U2\"\n\nbase_dir = \"\/home\/george\/Documents\/webscrap\/lyrics\/\" + artist  + \"_html\/\"\ndest_dir = \"\/home\/george\/Documents\/webscrap\/lyrics\/\" + artist + \"_Lyrics\/\"\nsearch_phrase = \"third-party lyrics provider\"\n\nsubstitution_words = {\"MxM banner\": \"\"      # remove phrase at end of text\n                    ,\"\u00e2\u0080\u0099\": \"'\"             # convert utf-8 into apostrophe\n                    ,\"\u00e2\u0080\u00a6\": \"\"              # remove \"...\"\n                    }\n\ndef dictionary_replace(text ,replacement_dictionary):\n    for search_term, replacement_term in replacement_dictionary.items():\n        text = text.replace(search_term, replacement_term)\n    return text\n\nregex_pattern = re.compile(r\"\\[(.*?)\\]\",re.IGNORECASE)\ndef regex_replace(text):\n    text = regex_pattern.sub(\"\",text)\n    return(text)<\/pre><\/div>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" style=\"font-size:20px\"><strong>Step 4.01 Open the .csv File from Prior Step<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Will open the .csv files saved in the step 3.  After opening the file, will read each row and store the song name and the URL in a dictionary.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \"># ----------------------------------------------------------------------------\n# step 4.01 \n# open the saved .csv file from the local disk\n# ----------------------------------------------------------------------------\ncsv_filename = base_dir + artist + \"_lyrics.csv\"\n\nsong_links = {}\n\nwith open(csv_filename, mode='r') as csv_file:\n    csv_reader = csv.DictReader(csv_file)\n    for row in csv_reader:\n        song_links[row[\"song_name\"]] = row[\"song_url\"]\n\nprint(F\"rows in file: {len(song_links)}\")\n\ncounter = 0\n<\/pre><\/div>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" style=\"font-size:20px\"><strong>Step 4.02 Extract the Lyrics from the .html File<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The heart of the process, I open each .html file that was saved in step 3.  With the file I use the Beautiful Soup package to parse through all of the .html elements. The lyrics have a key phrase in the .html that I can search for.  I used the variable search_phrase to find the element.  The element will indicate that the next element will start to contain the song lyrics.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I append the contents of the div element until I reach an element that isn&#8217;t a div element. This particular pattern matching is unique to this particular website. It will not always be applicable to all web sites. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The last bit of this section is to do a quick substitution. I wrote a custom function to help with helping the substitution. When extracting the content of the div elements, the last div was superfluous and can be removed.  The apostrophe was stored in a unique format that I substituted and when encountering the ellipses I remove the text.  Using the dictionary method of text substitution allows for rapid of substitution of a large amount of pre-defined phrases that need to be ignored from the .html elements.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Additionally I created another custom function using the regex package to remove leading and trailing brackets from the text.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \"># ----------------------------------------------------------------------------\n# step 4.02 \n# loop through the html files in the dictionary\n# ----------------------------------------------------------------------------\n\nfor song_name in song_links:\n    counter +=1\n    if counter &gt;= test_level:\n        break\n    print(F\"Opening item: {counter} - {song_name}\")\n\n    with open(base_dir + song_name + '.html') as current_file:    \n        read_html_file = current_file.read()\n        \n        # create the beautiful soup object\n        lyric_page = BeautifulSoup(read_html_file, 'html.parser')\n        # string to hold the lyric text\n        lyric_text = \"\"\n        for comment in lyric_page.findAll(text=lambda text:isinstance(text, Comment)):\n            if search_phrase in comment:\n                next_el = comment.next_element\n                \n                # search for the closing div \n                while next_el.name != \"div\":\n                    if next_el.name is None:\n                        #append the text to the lyric string\n                        lyric_text = lyric_text + next_el.string\n\n                    next_el = next_el.next_element\n\n    # clean up some scrub text in the lyrics file\n    lyric_text = dictionary_replace(lyric_text, substitution_words)\n    lyric_text = regex_replace(lyric_text)\n    #print(lyric_text)<\/pre><\/div>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" style=\"font-size:20px\"><strong>Step 4.03 Save the File<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now that the lyrics have been extracted I can save the plain text of the lyrics into a local file. <\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">    # ----------------------------------------------------------------------------\n    # step 4.03 \n    # write the file to the local file\n    # ----------------------------------------------------------------------------\n    print(F\"Saving file: {counter} - {song_name}.txt\")\n    with open(dest_dir + song_name + '.txt' ,'w') as text_file:\n        text_file.write(lyric_text)<\/pre><\/div>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"has-large-font-size wp-block-paragraph\" id=\"web-scarp-rankings-artist\"><strong>Step 5 Scrap Rankings by Artists for Each Song<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Next, need to get the rankings of each song.  I am using the Billboard Hot 100 to determine how well a song did commercially.  The rank is in ascending order.  The song that sold the most copies and where applicable had the most streams will be ranked first.  The ranking is unique and doesn&#8217;t align with how other sources may rank the commercial success of a song in prior periods. The rankings are issued weekly. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Using the requests package will get the .html page from the source website.  The rankings were organized in a multiple page layout.  I used the variables starting_row and ending_row to determine how many pages of rankings were available for the artists. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I output to the screen which ranking page I am currently fetching.  After the page is fetched from the remote server, I save the .html file to a local file for parsing in the next step.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I use the time package to put in a delay to prevent being blocked from fetching the web page content.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \"># ----------------------------------------------------------------------------\n# step five \n# go out to website, grab html page with rankings, save locally\n# ----------------------------------------------------------------------------\n\nimport csv\nimport time\nimport requests\n\n# test_level = 1000\nstarting_row = 1\nending_row = 7\nartist = \"Rihanna\"\nbase_dir = \"Documents\/ranks\/\" + artist + \"\/\"\n\nbase_html = \"https:\/\/www.billboard.com\/music\/\" + artist.lower() +\"\/chart-history\/HSI\"\n\ncounter = starting_row\n# check where we are in getting items from the dictionary\nwhile counter &gt;= starting_row and counter &lt;= ending_row:\n    # get the ranking page\n    if counter &gt; 1:\n        get_page = base_html +\"\/\" + str(counter)\n    else:\n        get_page = base_html\n    print(F\"Fetching: {get_page}\")\n\n    # get the html page\n    ranking_page = requests.get(get_page)\n    # show the status of the page\n    print(F\"Page: {counter}  Status: {ranking_page.status_code}\")\n\n    with open(base_dir + artist + '_' + str(counter) + '.html' ,'w') as html_file:\n        html_file.write(ranking_page.text)\n    \n    # wait so we don't overload remote site\n    if counter &lt;= ending_row:\n        time.sleep(3)\n    \n    counter +=1\n<\/pre><\/div>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"has-large-font-size wp-block-paragraph\" id=\"parse-song-rankings\"><strong>Step 6 Parse the Song Rankings Files<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Similar to how the lyrics were extracted from the .html files, I will follow a similar process and extract the rankings for each song that appeared on the Billboard Hot 100 for the artist.  Use the glob package to help store the list of files in the directory to process.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"width-set:true lang:python decode:true \"># ----------------------------------------------------------------------------\n# step 6\n# use the html rank files saved locally and extract the rankings per song\n# ----------------------------------------------------------------------------\n\nimport csv\nimport glob\nfrom bs4 import BeautifulSoup\n\nimport pandas as pd\n\ntest_level = 1000\nartist = \"U2\"\n\nbase_dir = \"\/home\/george\/Documents\/webscrap\/ranks\/\" + artist + \"\/\"\ndir_list = glob.glob(\"\/home\/george\/Documents\/webscrap\/ranks\/\" + artist + \"\/*.html\")\ndest_dir =  \"\/home\/george\/Documents\/webscrap\/ranks\/\" + artist + \".txt\"<\/pre><\/div>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" style=\"font-size:20px\"><strong>Step 6.01 Parse the Rankings Files for the Artist<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For each rankings file in the directory, I use the Beautiful Soup package will search through the html elements looking for  key phrases to determine what type of content is in the div.  The search phrase for each type of content is unique.  I am able to use the class part of the in the link or div element to determine the content. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \"># ----------------------------------------------------------------------------\n# step 6.01\n# loop through the html files in the directory\n# ----------------------------------------------------------------------------\n# dictionary of song ranks\nsong_ranks ={}\ncounter = 0\n# unique div names\ndiv_ranking = \"artist-section--chart-history__title-list\"\ndiv_song_title = \"artist-section--chart-history__title-list__title__text\"\ndiv_song_title_link = \"artist-section--chart-history__title-list__title__text--title\"\ndiv_song_artist = \"artist-section--chart-history__title-list__title__text--artist-name\"\ndiv_song_peak = \"artist-section--chart-history__title-list__title__text--peak-rank\"\n\n# list to hold values\nsongs = []\npeek_rank = []\npeek_date = []\n\nfor file_name in dir_list:\n    counter +=1\n    if counter &gt;= test_level:\n        break\n    print(F\"Opening item: {counter} - {file_name}\")\n\n    with open(file_name) as current_file:\n        read_html_file = current_file.read()\n        \n        # create the beautiful soup object\n        rank_page = BeautifulSoup(read_html_file, 'html.parser')\n\n        # string to hold the ranking html\n        ranking_text = rank_page.find(\"div\", {\"class\" : div_ranking})\n        ranking_html = BeautifulSoup(read_html_file, \"html.parser\")\n\n        for song_title in ranking_html.find_all(\"a\", {\"class\" : div_song_title_link}):\n            songs.append(song_title.string)\n\n        for song_rank in ranking_html.find_all(\"div\", {\"class\" : div_song_peak }):\n            peek_rank.append(song_rank.next_element.string)\n            peek_date.append(song_rank.find(\"a\").string)\n\n<\/pre><\/div>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" style=\"font-size:20px\"><strong>Step 6.02 Put the Song Details into a Data Frame<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Use the pandas package, write the extracted information to a data frame. I use the data frame to aid with exporting the information to a tab delimited file. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If there are any extraneous blank lines or lines with the phrase &#8220;Peaked at&#8221; from the data frame. The rows are not needed for processing. <\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \"># ----------------------------------------------------------------------------\n# step 6.02\n# put list into a dataframe\n# ----------------------------------------------------------------------------\n# merge the lists\nsong_details = list(zip(songs,peek_rank,peek_date))\n\n# write to data frame\nsong_df = pd.DataFrame(song_details, columns=[\"song_title\" ,\"peek_rank\" ,\"peek_date\"])\n\n# remove new lines from the data frame\nsong_df = song_df.replace(r'\\\\n','', regex=True) \nsong_df = song_df.replace(r'\\n','', regex=True) \nsong_df = song_df.replace(r'Peaked at ','', regex=True) <\/pre><\/div>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" style=\"font-size:20px\"><strong>Step 6.03 Save to a Tab Delimited File<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">And the last step will use the csv package to save the content locally to a tab delimited file.  <\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \"># ----------------------------------------------------------------------------\n# step 6.03 \n# save the text to a local file\n# use TSV\n# ----------------------------------------------------------------------------\n\nsong_df.to_csv(dest_dir, sep=\"\\t\")<\/pre><\/div>\n","protected":false},"excerpt":{"rendered":"<p>To obtain the data I am going to use Beautiful Soup and a few other packages to scrap the content from the websites. Part 2 of 4 Steps for Getting the Data Get the Songs Made by the Artists Extract the List of Songs by the Artist Scrap the Lyrics for Each Song by the [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_crdt_document":"","footnotes":""},"categories":[3,58,4,59],"tags":[28,56,57],"series":[],"class_list":["post-330","post","type-post","status-publish","format-standard","hentry","category-python","category-datascience","category-code","category-songlyrics","tag-python","tag-data-science","tag-web-scrapping"],"_links":{"self":[{"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/posts\/330","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/comments?post=330"}],"version-history":[{"count":22,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/posts\/330\/revisions"}],"predecessor-version":[{"id":414,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/posts\/330\/revisions\/414"}],"wp:attachment":[{"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/media?parent=330"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/categories?post=330"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/tags?post=330"},{"taxonomy":"series","embeddable":true,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/series?post=330"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}