Hadoop Spark – Word Count
One of the first things to do in most programming languages is create a “Hello World!” program. The equivalent in Spark is to create a program that will read the contents of a file and count the number of occurrences of each word.
Below I will show a basic example, so let’s start counting.
I loaded the file into the hadoop file system. After loading the file we will use Python to process the file.
1 2 |
lines = sc.textFile("hdfs:/user/cloudera/words.txt") lines.count() |
Read the contents of the words.txt file and store it in a variable called lines. Next will ask for how many lines are in the file; result is
1 |
124456 |
Next we will split the words on logical word boundaries store in the list variable words. In this case I am using space as the word boundary.
1 2 |
words = lines.flatMap(lambda line : line.split(" ")) words.take(5) |
And then ask for the first five words in the list.
1 |
['This', 'is', 'the', '100th', 'Etext'] |
Next we will covert the list into a tuple. And with each tuple we will assign a value of 1. When we do the count of words, it will sum the value in the second part of the tuple and show the results.
1 2 |
tuples = words.map(lambda word : (word, 1)) tuples.take(5) |
And the result for the first five elements in the tuple are
1 |
[('This', 1), ('is', 1), ('the', 1), ('100th', 1), ('Etext', 1)] |
It seems to make perfect sense. We now have a list of tuples with each word and a value of 1. Now we can sum up the values by using key-pairs. Every time two matching keys are found, Spark will merge the keys and add the values in the second element.
1 2 |
counts = tuples.reduceByKey(lambda a, b: (a+b)) counts.take(5) |
The results were stored in a variable called counts. And the first five values are shown below.
1 |
[('', 517065), ('Quince', 1), ('Corin,', 2), ('Just', 10), ('enrooted', 1)] |
We can see there were 517065 non-printable characters which we can ignore; and there was 10 occurrences of the word “Just”, etc..
Now we can save our results to the hdfs partition.
1 |
counts.coalesce(1).saveAsTextFile("hdfs:/user/cloudera/wordcount/outputDir") |
And now we have our first Spark, “Hello World!” program.
I hope you found the above informative, let me know if you have any questions in the comments below.
— michael.data@eipsoftware.com
Leave a Reply