Hadoop Spark – Word Count

One of the first things to do in most programming languages is create a “Hello World!” program.  The equivalent in Spark is to create a program that will read the contents of a file and count the number of occurrences of each word.

Below I will show a basic example, so let’s start counting.

I loaded the file into the hadoop file system.  After loading the file we will use Python to process the file.

Read the contents of the words.txt file and store it in a variable called lines.  Next will ask for how many lines are in the file; result is

Next we will split the words on logical word boundaries store in the list variable words.  In this case I am using space as the word boundary.

And then ask for the first five words in the list.

Next we will covert the list into a tuple. And with each tuple we will assign a value of 1.  When we do the count of words, it will sum the value in the second part of the tuple and show the results.

And the result for the first five elements in the tuple are

It seems to make perfect sense.  We now have a list of tuples with each word and a value of 1. Now we can sum up the values by using key-pairs.  Every time two matching keys are found, Spark will merge the keys and add the values in the second element.

The results were stored in a variable called counts.  And the first five values are shown below.

We can see there were 517065 non-printable characters which we can ignore; and there was 10 occurrences of the word “Just”, etc..

Now we can save our results to the hdfs partition.

And now we have our first Spark, “Hello World!” program.

I hope you found the above informative, let me know if you have any questions in the comments below.

— michael.data@eipsoftware.com