Effective Text Preprocessing
Explore effective text preprocessing techniques in Python for entity resolution. Understand how semantic-preserving transformations, phonetic algorithms, and domain-specific cleaning boost similarity scoring and matching quality for duplicate detection.
Our goal is to resolve restaurants. Two records will be called duplicates if customer_name, street, city, and phone combined are similar enough. All these attributes are strings. We will apply a few cheap and effective preprocessing steps, increasing our matching quality by a large margin.
Semantic-preserving string manipulations
What do all transformations below have in common?
customer_name:Hyde Street Bistro>hyde street bistrostreet:70 w. 68th st.>70 w 68th st>70 west 68th streetcity:L.A.>la>los angelesphone:212/362-2200>2123622200
They alter the text without altering the information content relevant to our matching task. In short, they preserve semantics in our context. Why manipulate at all if all versions are equivalent in meaning? The answer is that it might not matter for humans, but it does for algorithms we use for ...