Preprocess Locations Using Open Data

Learn to leverage open location data to fix location data quality issues.

Locations are part of many records. People have home and work addresses, companies have bill-to and ship-to addresses, and many more records have at least a country, postcode, or other geographic attribute. Locations can be powerful predictors of match vs. no-match models or at least help us reduce the index—why compare two customer records if the bill-to country is not the same?

First, we should ask ourselves: Is the quality of our location data trustworthy enough to rely on? Even something as simple as the country attribute can be messy when systems allow users to record them in free-text fields instead of curated drop-down lists.

Use pattern matching

If the country entry is a free-text field, expect some users to enter the ISO code and others the written-out form. Some will use English as the standard, and others will use their native language. Some will use capital case, and some will use a mix of all cases.

There are roughly 200 countries on Earth. So, let’s take care of this relatively short list and fix the variations with pattern matching and a single representative string per country.

Get hands-on with 1200+ tech skills courses.