Spark – Streaming Data, Capturing and Querying

Today we will look at how to capture streaming data and perform some simple queries as the data is streamed. We will use the regular expressions library and the PySpak library.  The streaming data comes from a weather station that transmits different weather at different intervals.  We will need to find the correct data out of the stream and output the results.

Let’s get started.

Environment Initialization

The streaming data comes from a weather station located in Southern California, USA.  The University of San Diego has a web portal to be able to read the data as it is passed from the weather stations to their computer center. We will connect to the web portal and read the raw values as they are repeated.

Some measurements are transmitted more frequently than others.  Wind speed and wind direction are transmitted more often, while temperature and rain are transmitted at longer intervals.

Example Weather Data

Description of some of the codes used in the weather data.

Today we will be looking at the Wind Direction average weather data as it is streamed from the weather station.

Parse the Weather Data

We will use a regular expression to extract the wind direction weather data from the other transmitted data.

As the weather data is read in, we will search for lines that Dm= and extract the direction of the wind.  The integer value refers to the compass direction for the wind direction.  For example 0 would be North, 90 for East, etc.

Next we will use the streaming object from the PySpark library to connect to the web portal and get the weather data.

Note: I removed the real web address because I don’t have permission to grant to other people access to the weather data.

As each line of weather data is received it will be stored in the variable lines.

We defined a custom function to read through the lines of data and print out the max and min values for the specified window of data.

Next we will open the connection to the streaming data and view the results.

As the streaming data comes in we can see how the function is looking at the last 10 values and determining the max and min value.

And lastly we need to make sure we close our streaming connection.

In this post we created custom functions to query streaming data, made a connection to streaming data source, and had the real time results show us the current max and min values for the specified window.

I hope you found the above informative, let me know if you have any questions in the comments below.