Hadoop Spark – Hello World, Word Count

Hadoop Spark – Word Count

One of the first things to do in most programming languages is create a “Hello World!” program. The equivalent in Spark is to create a program that will read the contents of a file and count the number of occurrences of each word.

Below I will show a basic example, so let’s start counting.

I loaded the file into the hadoop file system. After loading the file we will use Python to process the file.

lines = sc.textFile("hdfs:/user/cloudera/words.txt")
lines.count()

1 2	lines = sc.textFile("hdfs:/user/cloudera/words.txt") lines.count()

Read the contents of the words.txt file and store it in a variable called lines. Next will ask for how many lines are in the file; result is

124456

124456

Next we will split the words on logical word boundaries store in the list variable words. In this case I am using space as the word boundary.

words = lines.flatMap(lambda line : line.split(" "))
words.take(5)

1 2	words = lines.flatMap(lambda line : line.split(" ")) words.take(5)

And then ask for the first five words in the list.

['This', 'is', 'the', '100th', 'Etext']

1	['This', 'is', 'the', '100th', 'Etext']

Next we will covert the list into a tuple. And with each tuple we will assign a value of 1. When we do the count of words, it will sum the value in the second part of the tuple and show the results.

tuples = words.map(lambda word : (word, 1))
tuples.take(5)

1 2	tuples = words.map(lambda word : (word, 1)) tuples.take(5)

And the result for the first five elements in the tuple are

[('This', 1), ('is', 1), ('the', 1), ('100th', 1), ('Etext', 1)]

1	[('This', 1), ('is', 1), ('the', 1), ('100th', 1), ('Etext', 1)]

It seems to make perfect sense. We now have a list of tuples with each word and a value of 1. Now we can sum up the values by using key-pairs. Every time two matching keys are found, Spark will merge the keys and add the values in the second element.

counts = tuples.reduceByKey(lambda a, b: (a+b))
counts.take(5)

1 2	counts = tuples.reduceByKey(lambda a, b: (a+b)) counts.take(5)

The results were stored in a variable called counts. And the first five values are shown below.

[('', 517065), ('Quince', 1), ('Corin,', 2), ('Just', 10), ('enrooted', 1)]

1	[('', 517065), ('Quince', 1), ('Corin,', 2), ('Just', 10), ('enrooted', 1)]

We can see there were 517065 non-printable characters which we can ignore; and there was 10 occurrences of the word “Just”, etc..

Now we can save our results to the hdfs partition.

counts.coalesce(1).saveAsTextFile("hdfs:/user/cloudera/wordcount/outputDir")

1	counts.coalesce(1).saveAsTextFile("hdfs:/user/cloudera/wordcount/outputDir")

And now we have our first Spark, “Hello World!” program.

I hope you found the above informative, let me know if you have any questions in the comments below.

— michael.data@eipsoftware.com

Trying to share some code ideas

Hadoop Spark – Hello World, Word Count

Hadoop Spark – Word Count

Leave a Reply Cancel reply

Hadoop Spark – Hello World, Word Count

Hadoop Spark – Word Count

Read CSV files using Python Panda Dataframes

Spark – SQLContext

Leave a Reply Cancel reply