Advanced Semantic Similarity Methods

In this section, we'll discover advanced semantic similarity methods for word, phrase, and sentence similarity. We've already learned how to calculate semantic similarity with spaCy's similarity method and obtained some scores. But what do these scores mean? How are they calculated? Before we look at more advanced methods, first, we'll learn how semantic similarity is calculated.

Understanding semantic similarity

When we collect text data (any sort of data), we want to see how some examples are similar, different, or related. We want to measure how similar two pieces of text are by calculating their similarity scores. Here, the term semantic similarity comes into the picture; semantic similarity is a metric that's defined over texts, where the distance between two texts is based on their semantics.

A metric in mathematics is basically a distance function. Every metric induces a topology on the vector space. Word vectors are vectors, so we want to calculate the distance between them and use this as a similarity score.

Now, we'll learn about two commonly used distance functions: Euclidian distance and cosine distance. Let's start with Euclidian distance.

Euclidian distance

The Euclidian distance between two points in a k-dimensional space is the length of the path between them. The distance between two points is calculated by the Pythagorean theorem. We calculate this distance by summing the difference of each coordinate's square and then taking the square root of this sum. The following diagram shows the Euclidian distance between two vectors, dog and cat:

Get hands-on with 1200+ tech skills courses.