Distance Calculations in NLTK

Discover the alternative forms of distance in the NLTK package, such as Jaccard and Jaro-Winkler.

Packages for distance calculations

Distances are fundamental to natural language processing, and because of this, there are a number of packages aimed to simplify these calculations. The first of which is NLTK, which has distance calculations as a part of its overarching package. Another popular package is Fuzzy-wuzzy, a silly-sounding package that specifically specializes in different types of string matching and distance calculations.

NLTK metrics

There are three main metrics we will cover.

Edit distance

To calculate the edit distance between two strings using Python's NLTK package, you can use the edit_distance() function from the nltk.metrics.distance module. The module is pretty self-explanatory but has a couple of extra parameters.

The edit_distance() function can also take an optional third argument substitution_cost, which specifies the cost of a substitution operation, defaulting to 1.

We can also specify if a transposition counts as an edit (e.g., ba -> ab is 1 edit) by setting transpositions=True. This has some interesting advanced applications that we may explore later in this course.

You can see some sample usage below!

Get hands-on with 1400+ tech skills courses.