Location Resolution: Geocoding
Use geocoding services to preprocess location attributes in many entity resolution tasks.
We'll cover the following...
Addresses are everywhere: home, work, bill-to, and ship-to addresses as part of people, companies, factories, distribution hubs, orders, invoices, and more. Therefore, matching addresses is part of many entity resolution tasks. Consider geocoding as a preprocessing step or partial resolver of entities with location attributes.
We have geocoded the given four search strings with geopandas
Only the second line does not look good. The quality of the input is not the best, after all. This is also indicated by the df_geo['rank'] column consisting of dictionaries, which we can expand with a pandas function below:
We drop the second record for the rest of this lesson. The other three have the same street address and yet the last one has slightly different geocoordinates:
It’s unfair to declare one right and the other wrong. Geocodes are point coordinates, whereas locations are areas—in other words, an infinite set of points. So, we should be cautious with exact matching on geocoordinates to resolve location records.
We can address this problem in many ways. First, we can use areas instead of points to represent locations.
The address of a large office building is “Kruppstr. 4” so we still end up with different Plus Codes and H3 indexes on resolution level 8 for this same location. We can change the resolution of each by dropping some of the last characters of Plus Codes or by explicitly setting the level in H3. This can quickly result in an area covering more than we want.
There is a more elegant solution to our problem. We don’t persist on the exact matching of names and other strings in other resolution tasks. So, why should we make this mistake for locations?
The output of the last cell tells us that the two distinct geocodes are “27.84” meters apart. Deciding if this is close enough to claim a match is up to you, task by task.
Does a geocoding service resolve all our location records?
Geocoding services work well on street addresses. No service covers all addresses, and not every location is a street address—for example, some services might also work for PO boxes but many more exotic examples, like “reactor 2 of power plant XYZ,” exist. There is no chance that one service will rule them all.
Having a second similarity function for locations is good if the geocoding service responds with low confidence. We can compare the strings that make up a location description with edit distances, just like we are used to doing for names. We don’t need to decide on one approach against the other. We can combine evidence from both to make a final match or no-match conclusion.
Key takeaway
We can use geocoding services to resolve location records. Others spent decades refining their services, building massive address databases, and more. It is not just about the quality of results and costs. Check out one of the many providers using the OSM ecosystem under the hood. Their license allows us to store and distribute results—something we want to do as a preprocessing step in our entity resolution pipeline.