SQL-First Entity Resolution with Splink

Explore how to implement SQL-first entity resolution using the Splink framework. Understand its integration with Python and SQL engines like DuckDB, leveraging the Fellegi-Sunter probabilistic model to resolve entities without manual labels. Learn the workflow from data preparation to model interpretation, enabling scalable deduplication on common analytics platforms.

We'll cover the following...

Introducing Splink
Fellegi-Sunter intuition
Key takeaway

Many companies are heavily invested in SQL-first analytics platforms. Common commercial examples are Snowflake, Databricks, Google BigQuery, and Amazon Athena. These engines are optimized for computationally expensive data transformation jobs authored in SQL. Wouldn’t it be great to utilize the same SQL engine for expensive entity resolution workloads?

Introducing Splink

Splink is another entity resolution framework. Learners following this course might ask how it differs from RecordLinkage. Two key things have been given below:

In Splink, we only author jobs in Python. The framework translates this into SQL and sends it to a warehouse for the heavy lifting.
Splink is limited to the Fellegi-Sunter model family, which does not require manual labels to train the model. This means we need to worry less about modeling, for example, labeling and choosing among classification algorithms. It also means less complexity in what can be learned. The Fellegi-Sunter model is similar to a logistic regression—no boosted trees, no deep learning.

Fortu ...

1.Introduction to Entity Resolution and Applications

2.A Quickstart Guide Using the RecordLinkage Package

3.Preprocessing

4.Indexing

5.Feature Engineering

6.Pairwise Matching

7.Clustering

8.Integration

Assessment

Mini Project

9.Conclusion

10.Appendix

Project

SQL-First Entity Resolution with Splink

Introducing Splink