Where do you live? May be a simple question, but when companies are asking where someone lives, the answer isn’t always so easy. The physical address and the mailing address of a person can be two completely separate things. And when companies want to send direct mail, they need to ensure the message is appropriate.

Today I am going to show a Python program that I wrote that can process millions of addresses and verify if the address is a valid mailing address that matches the physical address. And how the majority of online mapping services get it wrong.

One the main challenges about physical addresses is that you can’t rely on online mapping services for 100% verification. Many of the top mapping services, i.e. Google Maps, will approximate an address if it isn’t in its database. An example of this is 1500 West Pine Street, Anywhere, AK. If Google Maps knows the street and the city, it will mark the location as the closest street number to the requested street number.

The result may be close enough for driving directions or to help give you an estimate of where the location is, but for the postal system, this doesn’t work. If the building does exist, it is ok; however if it doesn’t exist, then your mail will be returned. And a marketing opportunity will have been lost.

The United States Postal Service does have an address verification tool. However it doesn’t allow for large datasets to be verified. Many of the other online API tools, paid or unpaid are the same. I resolved the issue by splitting the source file into manageable block sizes for requests. Next when the response is received I combine the responses and re-arrange the responses to match the order the requests were made.

The program uses concurrent futures package to allow for multi-threaded operations and allow for numerous requests to be sent and received at the same time.

Program Overview

The program uses the tkinter UI package. I won’t go in-depth on how the tkinter package works, that would be a separate posting all on its own. This post will be long enough as it is.

Program Files

File NameDescription
LocusApiSearch.pyThe main file imports the packages and contains the main program loop
cls/apiHandler.pyclass to handle connecting to the API
cls/csvFileHandler.pycreates interim .csv files and
cls/outputFile.pydefines the definition, i.e. colums of the output file
cls/pyUtilities.pyhelper function to merge interim .csv files into the final file
cls/runtimeParameters.pyclass to store user parameters from the UI
cls/ui.pymain class for handling the UI components
cls/uiConfig.pyreading configuration options from the UI and saving to local file
cls/userAddresses.pydefinition of the input data file and methods to clean data

Packages

  • datetime
  • time
  • os
  • pathlib
  • numpy
  • pandas
  • json
  • requests
  • tkinter
  • concurrent.futures

Program Flow

Main

LocusApiSearch.py

The first code block starts the UI, I assigned the UI to root. and assigned the application to app. Next I tell the UI to start the main loop.

UI

I am going to skip over the UI startup and drawing. There are multiple tutorials on how to use the tkinter UI package for Python. In summary cls/ui.py and cls/uiConfig.py handle drawing the screen windows and controls in the UI.

Application Start

LocusApiSearch.py

start_job()

start_job(user_args: dict)

Pass in the dictionary that was generated from reading contents of the UI.

Next, use the runtimeParameters class to organize the parameters. Because we are using multi-threaded processing need to have a local copy of the parameters for the current batch of data that is being processed.

The next thing is store into the log file that starting the job, and release the lock on the file.

Will use the csvFileHandler class to read the data file. Write a message to the UI to the let user know where we are in the processing. Update the args.set_file_length property with the file length.

Initiate the Thread Pool

Here we will start processing the blocks of data. Each block of data will be assigned its own thread. I capped the maximum number of threads at 8. The number of threads could be set in the UI, but I opted against it.

cf is from the concurrent.futures package and the method ThreadPoolExecutor is what is used to manage the pool of threads.

call_locus_api is definition of the function I want to execute with all of the applicable parameters.

Use a for loop to process each data block. I do a quick check to see if the cancel button was pushed to prevent the data block from processing. Threads that are currently processing will finish.

fx_result = call_locus_api[worker]

Now send the request out to the remote API, using the function that was defined for call_locus_api. Explanation of the process_csv_file function below.

Use a try / except block to handle any errors incurred from the calling the API. If the API results are successful notify the user in the result messages window. If an error was raised, notify the user in the progress window.

LocusApiSearch.py

process_csv_file()

Pass the parameters needed to call the API. Hard coded the maximum retry attempts to 4. My remote API’s are highly reliable, if the block fails four time most likely there is an issue with the data vs the API not working.

Do a quick check to ensure the user didn’t click the cancel button, and if so continue on. Print messages to the progress messages window.

If it is the first attempt with the data block, will prep the data using the userAddresses class.

Take a subset of the data from the data frame.

The userAddresses class to select the appropriate columns from the input data set, cleanse the address and create a json file to send to the API.

Use the apiHandler class for each different api that will reference.

Send the request to the api.

Get the response from the API, parse the json file.

If the response is successful break out of the retry loop. And update the result message window. If the try failed, go to sleep for 1,5 seconds to not overload the remote API.

Even if successful will rate limit requests to the remote API by going to sleep for 0,5 seconds.