Reverse Geocoding Using PostGIS and Tiger Dataset
Turn Latitude and Longitude coordinates to human-readable addresses
As a data professional, you might at some point in your career come across geospatial data. What happens if this geospatial data contains longitudes and latitudes coordinates but you need addresses instead? This would require you to convert them to human-readable addresses.
So what tool can you use that won’t also take too much time? You already guessed the answer…
PostGIS
In this article, I’ll be taking you through the steps on using PostGIS to reverse geocode your dataset especially if it’s a large one. Mine contained about 4M+ rows and it took approximately 47 minutes to have all its coordinates converted to each column that contained human-readable addresses.
Why Use PostGIS?
PostGIS actually has many use cases but we’ll be focused on the use case of reverse geocoding.
PostGIS makes use of spatial indexing. Spatial indexing is a technique used in optimizing the search for spatial data by organizing them in such a way as to reduce the number of comparisons to be made.
For example, on a normal day, if iterating over a dataset could take 1,000,000 times to reverse geocode them, spatial indexing could take just 20,000 iterations. Yes, that’s how efficient it could be.
Also, PostGIS is already built to work with very large datasets because it has been built to leverage the performance optimizations of the PostgreSQL database engine.
Lastly, PostGIS uses parallel processing capabilities available in modern CPUs and GUIs. This is very important in speeding up the reverse geocoding process.
Prerequisites
- Have a basic understanding of SQL
- Have PostgreSQL and PostGIS already installed on your computer
- Bonus: Follow this guide carefully
Now that you’re ready to kick off, let’s go!
Download your Tiger Dataset
PostGIS will require you to merge your dataset with a reference dataset so you’ll need a reference geographical dataset that covers your desired regions. We’ll be using the Tiger Dataset which is a public-domain dataset provided by the United States Census Bureau.
When you visit the link, you’ll see hyperlinks named “FTP Archive by State” and “FTP Archive by Layer”.
Clicking “FTP Archive by State” would take you to a list of states in the United States of America while clicking “FTP Archive by Layer” would lead you to a list of geographical entities like roads, counties, states, etc.
Depending on your use case, you could choose the best for you. I chose to go with “FTP Archive by Layer” and clicked “UAC20/”, “STATE/”, and “COUNTY/” in the list. These links directed me to download the reference datasets for towns, states, and counties, respectively.
They were contained in folders and the folders contained shapefiles. These shapefiles are what we are interested in but they are loaded differently into PostgreSQL.
No worries! I’ll show you how to load these shapefiles into PostgreSQL
Loading the Shapefiles into Pg Admin
I’m using Pg Admin as my management tool for PostgreSQL so this particular step might be different from yours if you’re using a different tool.
When downloading PostGIS, a shapefile importer/exporter was automatically installed too. Here’s a screenshot of how the interface looks when opened.
Before clicking “Add File” to import your downloaded shapefile, click “View Connection Details”. You’ll get this pop-up
Add your database information and click “ok”.
Then go on to click “Add File” and select the folder in its location. Selecting a folder will automatically choose the shapefile from that folder leading to a screen similar to the image below.
Note that you can’t select multiple folders with this tool. One at a time
Then click “Open”.
Follow the same process in adding the other two folders for the towns and states.
Then you’ll have something similar to the image below.
Don’t click “import” yet. Click “Options” instead. You’ll get this pop-up
Ensure that the last item about generating simple geometries is ticked. This will import the data as single spatial objects and not as multiple objects that could be stored in one column.
After ticking the box, click “OK”. And then click “Import”.
This will import your shapefile which will automatically reflect in your database in Pg Admin.
So yes, your reference datasets are in the same database that also holds your main dataset (the one you want to reverse geocode). Ensure that both datasets are actually in the same database on PostgreSQL
It’s time to write codes!
Getting into Reverse Geocoding
Here’s a glimpse at how my main dataset looks
If you take a look at the reference datasets that we downloaded, they also have latitude and longitude columns named “Intptlat” and “Intptlon” respectively.
To reverse geocode your main dataset, you’ll need to derive geometry columns in both your reference and main dataset. This will be the joining key for your two datasets.
Don’t worry, it’s not hard.
Derive Geometry Columns
To derive geometry columns for your main dataset:
ALTER TABLE uber2014 ADD COLUMN geom geometry(Point, 4326);
UPDATE uber2014
SET geom = ST_SetSRID(ST_MakePoint(CAST("Longitude" AS double precision), CAST("Latitude" AS double precision)), 4326);
“uber2014” is the name of my dataset. “Longitude” and “Latitude” are the names of my columns. Ensure you have both coordinates in this order. Don’t make the mistake of reversing them.
Notice how I changed their datatypes to double precision while updating the “geom” columns. This is important to not have rounded-up numbers which can affect the precision of our locations.
Do the same for your reference datasets:
ALTER TABLE tl_rd22_us_uac20 ADD COLUMN geom geometry(Point, 4326);
UPDATE tl_rd22_us_uac20
SET geom = ST_SetSRID(ST_MakePoint(CAST(intptlon20 AS double precision), CAST(intptlat20 AS double precision)), 4326);
Remember to replace the table and column names with the respective table and column names for each dataset.
Notice how both datasets have the number 4326. 4326 is the spatial reference system identifier (SRID) that represents the coordinate reference system of my spatial data. SRID 4326 is associated with representing positions on the earth's surface using longitudes and latitudes.
I used 4326 because it’s common and my dataset already contains longitudes and latitudes.
You could use a different number like 4322, 3055, etc. However, ensure that it suits your use case and that whatever number you use is consistent in your main and reference datasets. If it’s not consistent, your data will be represented in different formats which could make your reverse geocoding process unsuccessful.
Merge your Datasets by the “geom” Column
-- Add a name column to the uber table
ALTER TABLE uber2014 ADD COLUMN "County" VARCHAR(50);
-- Create a temporary table to hold the reference data (counties)
CREATE TEMP TABLE temp_reference_data AS
SELECT name, geom
FROM tl_rd22_us_county;
-- Create a spatial index on the reference data
CREATE INDEX temp_reference_data_geom_idx ON temp_reference_data USING GIST(geom);
-- Update the name column in the uber table based on the nearest county
UPDATE uber2014
SET "County" = rd.name
FROM (
SELECT gd.geom, rd.name
FROM uber2014 gd
LEFT JOIN LATERAL (
SELECT name
FROM temp_reference_data rd
WHERE ST_Distance(gd.geom, rd.geom) <= 50 -- maximum distance in meters
ORDER BY gd.geom <-> rd.geom ASC
LIMIT 1
) rd ON true
) rd
WHERE gd.geom = rd.geom;
-- Drop the temp tables
DROP TABLE temp_reference_data;
-- Select all columns from the updated uber table
SELECT *
FROM uber2014;
Okay, I understand that this might be a lot for you so let me break it down to show you what each query did.
-- Add a name column to the uber table
ALTER TABLE uber2014 ADD COLUMN "County" VARCHAR(50);
The above query added a new column called “County” to my main dataset.
-- Create a temporary table to hold the reference data (counties)
CREATE TEMP TABLE temp_reference_data AS
SELECT name, geom
FROM tl_rd22_us_county;
This query created a temporary table that contained only the “name” and the “geom” column from my reference dataset. The “name” column contains the name of the counties. I felt it best to create a temporary table for this since I needed only these two columns from my reference dataset.
-- Create a spatial index on the reference data
CREATE INDEX temp_reference_data_geom_idx ON temp_reference_data USING GIST(geom);
This query creates a spatial index named “temp_reference_data_geom_idx” on the “geom” column of the “temp_reference_data” table using the GIST (Generalized Search Tree) indexing method.
This is very important for the speed of my query because there won’t have to be a full table scan to identify the rows in my main dataset that aligns with the records in my reference dataset.
-- Update the name column in the uber table based on the nearest county
UPDATE uber2014
SET "County" = rd.name
FROM (
SELECT gd.geom, rd.name
FROM uber2014 gd
LEFT JOIN LATERAL (
SELECT name
FROM temp_reference_data rd
WHERE ST_Distance(gd.geom, rd.geom) <= 50 -- maximum distance in meters
ORDER BY gd.geom <-> rd.geom ASC
LIMIT 1
) rd ON true
) rd
WHERE gd.geom = rd.geom;
For this one, take a deep breath.
Let’s start with the subquery after the FROM clause. The subquery selects only the “geom” column in my main dataset and the “name” column (that is, county names) from my reference dataset when it does a Left Join on my main dataset and reference dataset.
The lateral join limits the Left Join to only accommodate counties within 50 metres. The “<->” operator orders the results by distance, and the “LIMIT 1” clause is used to select only the nearest county.
The “LIMIT 1” clause is important because your join could actually pull in more than one match so having this clause in your query would select the nearest county after ordering your results with the “<->” operator.
The outer query in the UPDATE statement sets the “County” column in my main dataset to the name of the nearest county selected by the subquery. The WHERE clause joins my main dataset to the subquery on the “geom” column on the reference dataset.
Notice how my maximum distance is 50. It could vary depending on your use case. And the more the distance, the more you get results. And vice versa.
If I choose to use 5 metres, I would get some null values in my “County” column because the distance to be searched is less. But if I increase it to 5000 metres, I could get less accurate results.
So ensure you experiment with your maximum distance value to see the one that’s just right.
-- Drop the temp tables
DROP TABLE temp_reference_data;
-- Select all columns from the updated uber table
SELECT *
FROM uber2014;
I dropped the temporary table that I created based on the reference dataset to relieve my memory. And the last query selects all the columns in my main dataset.
Do the Same for your Other Datasets
For the dataset that contains towns:
-- Add a name column to the uber table
ALTER TABLE uber2014 ADD COLUMN "Town" VARCHAR(50);
-- Create a temporary table to hold the reference data (counties)
CREATE TEMP TABLE temp_reference_data AS
SELECT namelsad20, geom
FROM tl_rd22_us_uac20;
-- Create a spatial index on the reference data
CREATE INDEX temp_reference_data_geom_idx ON temp_reference_data USING GIST(geom);
-- Update the name column in the uber table based on the nearest county
UPDATE uber2014
SET "Town" = rd.namelsad20
FROM (
SELECT gd.geom, rd.namelsad20
FROM uber2014 gd
LEFT JOIN LATERAL (
SELECT namelsad20
FROM temp_reference_data rd
WHERE ST_Distance(gd.geom, rd.geom) <= 50 -- maximum distance in meters
ORDER BY gd.geom <-> rd.geom ASC
LIMIT 1
) rd ON true
) rd
WHERE gd.geom = rd.geom;
-- Drop the temp tables
DROP TABLE temp_reference_data;
-- Select all columns from the updated uber table
SELECT *
FROM uber2014;
For the dataset that contains the states:
-- Add a name column to the uber table
ALTER TABLE uber2014 ADD COLUMN "State" VARCHAR(50);
-- Create a temporary table to hold the reference data (counties)
CREATE TEMP TABLE temp_reference_data AS
SELECT name, geom
FROM tl_rd22_us_county;
-- Create a spatial index on the reference data
CREATE INDEX temp_reference_data_geom_idx ON temp_reference_data USING GIST(geom);
-- Update the name column in the uber table based on the nearest county
UPDATE uber2014
SET "State" = rd.name
FROM (
SELECT gd.geom, rd.name
FROM uber2014 gd
LEFT JOIN LATERAL (
SELECT name
FROM temp_reference_data rd
WHERE ST_Distance(gd.geom, rd.geom) <= 50 -- maximum distance in meters
ORDER BY gd.geom <-> rd.geom ASC
LIMIT 1
) rd ON true
) rd
WHERE gd.geom = rd.geom;
-- Drop the temp tables
DROP TABLE temp_reference_data;
-- Select all columns from the updated uber table
SELECT *
FROM uber2014;
Here are my results:
Conclusion
Great! You’ve come this far to understand how to reverse geocode your dataset by:
- Downloading your reference datasets (Tiger Dataset)
- Importing the shapefile reference datasets into Pg Admin
- Deriving geometry columns on both your main dataset and reference datasets.
- Merging the different reference datasets to your main dataset and getting the county, town, and state names.
I’m sure you learnt a lot from this. However, I would love to know if this really helped you in your journey in reverse geocoding. So please let me know in the comments.
Like the Author? Connect with Nancy Amandi.
I write articles and speak on Data Storytelling and some important concepts in Data Analysis. I also write articles to guide readers on technical data concepts.