Removing and Replacing Tokens

Explore techniques to improve entity resolution by removing non-informative tokens and replacing inconsistent abbreviations. Understand how stopwords and token replacements impact text similarity, and learn to apply efficient preprocessing using regular expressions and open datasets for practical data cleaning.

We'll cover the following...

Remove tokens aka stopwords
Replace tokens
Replace tokens using open data
- Key takeaway

A text consists of one or more words and other tokens. Some of those are more informative than others. Words can vary in spelling, grammar, language, and more. Let’s discuss which types of words should be removed and which should be replaced to improve the matching quality.

Remove tokens aka stopwords

Stopwords are text tokens that are not informative. They can do more harm than good in an entity resolution task. Let’s take the restaurants open dataset and three of its records as an example—see the Glossary for attribution and references for open data.

1.Introduction to Entity Resolution and Applications

2.A Quickstart Guide Using the RecordLinkage Package

3.Preprocessing

4.Indexing

5.Feature Engineering

6.Pairwise Matching

7.Clustering

8.Integration

Assessment

Mini Project

9.Conclusion

10.Appendix

Project

Removing and Replacing Tokens

Remove tokens aka stopwords