Effective Text Preprocessing

Learn how to preprocess string attributes with RecordLinkage in five lines of code.

Our goal is to resolve restaurants. Two records will be called duplicates if customer_name, street, city, and phone combined are similar enough. All these attributes are strings. We will apply a few cheap and effective preprocessing steps, increasing our matching quality by a large margin.

Semantic-preserving string manipulations

What do all transformations below have in common?

  • customer_name: Hyde Street Bistro > hyde street bistro

  • street: 70 w. 68th st. > 70 w 68th st > 70 west 68th street

  • city: L.A. > la > los angeles

  • phone: 212/362-2200 > 2123622200

They alter the text without altering the information content relevant to our matching task. In short, they preserve semantics in our context. Why manipulate at all if all versions are equivalent in meaning? The answer is that it might not matter for humans, but it does for algorithms we use for computing similarity.

For example, the Damerau-LevenshteinIt is an edit-based similarity function counting insertions, deletions, substitutions, and transpositions. similarity function between two strings is defined as follows:

Get hands-on with 1200+ tech skills courses.