Hadoop Spark – Word Count
One of the first things to do in most programming languages is create a “Hello World!” program. The equivalent in Spark is to create a program that will read the contents of a file and count the number of occurrences of each word.
Below I will show a basic example, so let’s start counting.
I loaded the file into the hadoop file system. After loading the file we will use Python to process the file.
| 
					 1 2  | 
						lines = sc.textFile("hdfs:/user/cloudera/words.txt") lines.count()  | 
					
Read the contents of the words.txt file and store it in a variable called lines. Next will ask for how many lines are in the file; result is
| 
					 1  | 
						124456  | 
					
Next we will split the words on logical word boundaries store in the list variable words. In this case I am using space as the word boundary.
| 
					 1 2  | 
						words = lines.flatMap(lambda line : line.split(" ")) words.take(5)  | 
					
And then ask for the first five words in the list.
| 
					 1  | 
						['This', 'is', 'the', '100th', 'Etext']  | 
					
Next we will covert the list into a tuple. And with each tuple we will assign a value of 1. When we do the count of words, it will sum the value in the second part of the tuple and show the results.
| 
					 1 2  | 
						tuples = words.map(lambda word : (word, 1)) tuples.take(5)  | 
					
And the result for the first five elements in the tuple are
| 
					 1  | 
						[('This', 1), ('is', 1), ('the', 1), ('100th', 1), ('Etext', 1)]  | 
					
It seems to make perfect sense. We now have a list of tuples with each word and a value of 1. Now we can sum up the values by using key-pairs. Every time two matching keys are found, Spark will merge the keys and add the values in the second element.
| 
					 1 2  | 
						counts = tuples.reduceByKey(lambda a, b: (a+b)) counts.take(5)  | 
					
The results were stored in a variable called counts. And the first five values are shown below.
| 
					 1  | 
						[('', 517065), ('Quince', 1), ('Corin,', 2), ('Just', 10), ('enrooted', 1)]  | 
					
We can see there were 517065 non-printable characters which we can ignore; and there was 10 occurrences of the word “Just”, etc..
Now we can save our results to the hdfs partition.
| 
					 1  | 
						counts.coalesce(1).saveAsTextFile("hdfs:/user/cloudera/wordcount/outputDir")  | 
					
And now we have our first Spark, “Hello World!” program.
I hope you found the above informative, let me know if you have any questions in the comments below.
— michael.data@eipsoftware.com
			
Leave a Reply