Distance Calculations in NLTK
Discover the alternative forms of distance in the NLTK package, such as Jaccard and Jaro-Winkler.
We'll cover the following
Packages for distance calculations
Distances are fundamental to natural language processing, and because of this, there are a number of packages aimed to simplify these calculations. The first of which is NLTK, which has distance calculations as a part of its overarching package. Another popular package is Fuzzy-wuzzy, a silly-sounding package that specifically specializes in different types of string matching and distance calculations.
NLTK metrics
There are three main metrics we will cover.
Edit distance
To calculate the edit distance between two strings using Python's NLTK package, you can use the edit_distance()
function from the nltk.metrics.distance
module. The module is pretty self-explanatory but has a couple of extra parameters.
The edit_distance()
function can also take an optional third argument substitution_cost
, which specifies the cost of a substitution operation, defaulting to 1
.
We can also specify if a transposition counts as an edit (e.g., ba
-> ab
is 1 edit) by setting transpositions=True
. This has some interesting advanced applications that we may explore later in this course.
You can see some sample usage below!
Get hands-on with 1400+ tech skills courses.