Motivating Clustering

Explore how clustering transforms entity resolution from pairwise matching to collective classification. Understand graph-based methods to resolve conflicts, improve prediction accuracy, and handle dependencies between record matches for practical applications.

We'll cover the following...

Clusters
Stochastic dependence
Resolving conflicts
Choosing clustering algorithms
Key takeaway

A typical entity resolution pipeline starts with preprocessing records $\tilde{r}_1=C(r_1),\ldots,\tilde r_n=C(r_n)$ individually. Next comes pairwise feature engineering $s_{ij}=F(\tilde{r}_i,\tilde{r}_j)$ , followed by pairwise matching $c_{ij}=M(s_{ij})$ , where $c=1$ represents a match and $c=0$ otherwise—a binary classification problem.

Collective entity resolution goes beyond pairs to improve outcomes from the collective evidence of any number of records. It is about improving the classification accuracy and resolving potential conflicts that would otherwise make the output impractical.

Clusters

Let’s reformulate our resolution task as a clustering problem on graphs. Starting from our pairwise predictions, we create a graph where nodes represent records $r_1,\ldots,r_n$ ...

1.Introduction to Entity Resolution and Applications

2.A Quickstart Guide Using the RecordLinkage Package

3.Preprocessing

4.Indexing

5.Feature Engineering

6.Pairwise Matching

7.Clustering

8.Integration

Assessment

Mini Project

9.Conclusion

10.Appendix

Project

Motivating Clustering

Clusters