An Introduction to Entity Resolution in Python/

...

Local Stack with DuckDB and dbt

Get familiar with dbt and data transformation pipelines.

We'll cover the following...

Configuring our data stack
Getting started with DuckDB
Authoring transformations with dbt
Key takeaway

A typical analytics environment in the corporate world is built around (distributed) central storage and query technology optimized for analytics. Any other components like ETL, business intelligence, and entity resolution must be integrated to maintain the efficiency of this data stack.

The distributed nature of technologies like Snowflake, Databricks, and BigQuery is abstracted away, so it feels like there is one place to store and query data. Let’s make this idea concrete by replicating this kind of stack with open source on a single machine.

Configuring our data stack

The following image illustrates a technology that could consist of several proprietary (and costly) components. Here, we will replicate the basic functionality with open source and refer to the different components by the colors in the image below:

Press + to interact

Introduction to Entity Resolution and Applications

A Quickstart Guide Using the RecordLinkage Package

Preprocessing

Indexing

Feature Engineering

Pairwise Matching

Clustering

Integration

Entity Resolution Fundamentals

Matching Products Across Two Online Shops

Conclusion

Appendix

Auto-Tagging System for Content Categorization

Local Stack with DuckDB and dbt

Configuring our data stack