SQL-First Entity Resolution with Splink
Explore how to implement SQL-first entity resolution using the Splink framework. Understand its integration with Python and SQL engines like DuckDB, leveraging the Fellegi-Sunter probabilistic model to resolve entities without manual labels. Learn the workflow from data preparation to model interpretation, enabling scalable deduplication on common analytics platforms.
We'll cover the following...
Many companies are heavily invested in SQL-first analytics platforms. Common commercial examples are Snowflake, Databricks, Google BigQuery, and Amazon Athena. These engines are optimized for computationally expensive data transformation jobs authored in SQL. Wouldn’t it be great to utilize the same SQL engine for expensive entity resolution workloads?
Introducing Splink
Splink is another entity resolution framework. Learners following this course might ask how it differs from RecordLinkage. Two key things have been given below:
In Splink, we only author jobs in Python. The framework translates this into SQL and sends it to a warehouse for the heavy lifting.
Splink is limited to the Fellegi-Sunter model family, which does not require manual labels to train the model. This means we need to worry less about modeling, for example, labeling and choosing among classification algorithms. It also means less complexity in what can be learned. The Fellegi-Sunter model is similar to a logistic regression—no boosted trees, no deep learning.
Fortu ...