{"id":258,"date":"2018-01-18T18:14:02","date_gmt":"2018-01-18T18:14:02","guid":{"rendered":"http:\/\/eipsoftware.com\/musings\/?p=258"},"modified":"2018-02-20T18:20:58","modified_gmt":"2018-02-20T18:20:58","slug":"hadoop-spark-hello-world-word-count","status":"publish","type":"post","link":"https:\/\/eipsoftware.com\/musings\/hadoop-spark-hello-world-word-count\/","title":{"rendered":"Hadoop Spark &#8211; Hello World, Word Count"},"content":{"rendered":"<h4>Hadoop Spark &#8211; Word Count<\/h4>\n<p>One of the first things to do in most programming languages is create a &#8220;Hello World!&#8221; program.\u00a0 The equivalent in Spark is to create a program that will read the contents of a file and count the number of occurrences of each word.<\/p>\n<p>Below I will show a basic example, so let&#8217;s start counting.<\/p>\n<p><!--more--><\/p>\n<p>I loaded the file into the hadoop file system.\u00a0 After loading the file we will use Python to process the file.<\/p>\n<pre class=\"lang:python decode:true\">lines = sc.textFile(\"hdfs:\/user\/cloudera\/words.txt\")\r\nlines.count()\r\n<\/pre>\n<p>Read the contents of the words.txt file and store it in a variable called lines.\u00a0 Next will ask for how many lines are in the file; result is<\/p>\n<pre class=\"lang:python decode:true \">124456<\/pre>\n<p>Next we will split the words on logical word boundaries store in the list variable words.\u00a0 In this case I am using space as the word boundary.<\/p>\n<pre class=\"lang:python decode:true \">words = lines.flatMap(lambda line : line.split(\" \"))\r\nwords.take(5)<\/pre>\n<p>And then ask for the first five words in the list.<\/p>\n<pre class=\"lang:python decode:true \">['This', 'is', 'the', '100th', 'Etext']<\/pre>\n<p>Next we will covert the list into a tuple. And with each tuple we will assign a value of 1.\u00a0 When we do the count of words, it will sum the value in the second part of the tuple and show the results.<\/p>\n<pre class=\"lang:python decode:true \">tuples = words.map(lambda word : (word, 1))\r\ntuples.take(5)<\/pre>\n<p>And the result for the first five elements in the tuple are<\/p>\n<pre class=\"lang:python decode:true \">[('This', 1), ('is', 1), ('the', 1), ('100th', 1), ('Etext', 1)]<\/pre>\n<p>It seems to make perfect sense.\u00a0 We now have a list of tuples with each word and a value of 1. Now we can sum up the values by using key-pairs.\u00a0 Every time two matching keys are found, Spark will merge the keys and add the values in the second element.<\/p>\n<pre class=\"lang:python decode:true \">counts = tuples.reduceByKey(lambda a, b: (a+b))\r\ncounts.take(5)<\/pre>\n<p>The results were stored in a variable called counts.\u00a0 And the first five values are shown below.<\/p>\n<pre class=\"lang:python decode:true \">[('', 517065), ('Quince', 1), ('Corin,', 2), ('Just', 10), ('enrooted', 1)]<\/pre>\n<p>We can see there were 517065 non-printable characters which we can ignore; and there was 10 occurrences of the word &#8220;Just&#8221;, etc..<\/p>\n<p>Now we can save our results to the hdfs partition.<\/p>\n<pre class=\"lang:python decode:true \">counts.coalesce(1).saveAsTextFile(\"hdfs:\/user\/cloudera\/wordcount\/outputDir\")<\/pre>\n<p>And now we have our first Spark, &#8220;Hello World!&#8221; program.<\/p>\n<p>I hope you found the above informative, let me know if you have any questions in the comments below.<\/p>\n<p><a href=\"mailto:michael.data@eipsoftware.com\">\u2014 michael.data@eipsoftware.com<\/a><\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Hadoop Spark &#8211; Word Count One of the first things to do in most programming languages is create a &#8220;Hello World!&#8221; program.\u00a0 The equivalent in Spark is to create a program that will read the contents of a file and count the number of occurrences of each word. Below I will show a basic example, [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_crdt_document":"","footnotes":""},"categories":[3,4,51],"tags":[28,30,52,53],"series":[],"class_list":["post-258","post","type-post","status-publish","format-standard","hentry","category-python","category-code","category-spark","tag-python","tag-code","tag-spark","tag-hadoop"],"_links":{"self":[{"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/posts\/258","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/comments?post=258"}],"version-history":[{"count":2,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/posts\/258\/revisions"}],"predecessor-version":[{"id":260,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/posts\/258\/revisions\/260"}],"wp:attachment":[{"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/media?parent=258"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/categories?post=258"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/tags?post=258"},{"taxonomy":"series","embeddable":true,"href":"https:\/\/eipsoftware.com\/musings\/wp-json\/wp\/v2\/series?post=258"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}