An Introduction to Entity Resolution in Python/

...

Effective Text Preprocessing

Learn how to preprocess string attributes with RecordLinkage in five lines of code.

We'll cover the following...

Semantic-preserving string manipulations

Our goal is to resolve restaurants. Two records will be called duplicates if customer_name, street, city, and phone combined are similar enough. All these attributes are strings. We will apply a few cheap and effective preprocessing steps, increasing our matching quality by a large margin.

Semantic-preserving string manipulations

What do all transformations below have in common?

customer_name: Hyde Street Bistro > hyde street bistro
street: 70 w. 68th st. > 70 w 68th st > 70 west 68th street
city: L.A. > la > los angeles
phone: 212/362-2200 > 2123622200

They alter the text without altering the information content relevant to our matching task. In short, they preserve semantics in our context. Why manipulate at all if all versions are equivalent in meaning? The answer is that it might not matter for humans, but it does for algorithms we use for ...

Introduction to Entity Resolution and Applications

A Quickstart Guide Using the RecordLinkage Package

Preprocessing

Indexing

Feature Engineering

Pairwise Matching

Clustering

Integration

Entity Resolution Fundamentals

Matching Products Across Two Online Shops

Conclusion

Appendix

Auto-Tagging System for Content Categorization

Effective Text Preprocessing

Semantic-preserving string manipulations