Search⌘ K
AI Features

Removing and Replacing Tokens

Explore techniques to improve entity resolution by removing non-informative tokens and replacing inconsistent abbreviations. Understand how stopwords and token replacements impact text similarity, and learn to apply efficient preprocessing using regular expressions and open datasets for practical data cleaning.

A text consists of one or more words and other tokens. Some of those are more informative than others. Words can vary in spelling, grammar, language, and more. Let’s discuss which types of words should be removed and which should be replaced to improve the matching quality.

Remove tokens aka stopwords

Stopwords are text tokens that are not informative. They can do more harm than good in an entity resolution task. Let’s take the restaurants open dataset and three of its records as an example—see the Glossary for attribution and references for open data.

C++
import pandas as pd
# See the course glossary in the appendix for attribution:
restaurants = pd.read_csv('solvers_kitchen/restaurants.csv')
for i, row in restaurants.iloc[[20, 21, 272], :].iterrows():
print('Record ', i)
print(row)
print('---')

The word “restaurant” in the customer_name attribute has little information content because this is a dataset about restaurants, cafes, and similar places. A word ...