Markov Clustering
Explore Markov clustering to improve entity resolution by identifying dense node connections through random walks. Understand its steps, application on geographic data, and how it outperforms alternatives in precision despite requiring hyperparameter tuning.
We'll cover the following...
Transitive clustering and MCC are popular in the community due to their simplicity and straightforward interpretation. However, they tend to underperform in scenarios with increasing cluster sizes. This, on the other hand, is the sweet spot of Markov clustering.
Resolving geographic settlements
We use the open geographic settlements dataset, where almost all clusters have a size of four. Below, we read the original data provided in JSON format and reshape the actual cluster assignments into a cross-reference table:
Each record consists of a name and geographic coordinates. We create one similarity feature for names using the Jaro-Winkler scores and another based on geodesic distances between coordinates. We use a rule-based model to predict matches and do a quick coherence ...