...

Advanced Semantic Similarity Methods

Let's discover advanced semantic similarity methods for word, phrase, and sentence similarity.

We'll cover the following...

Understanding semantic similarity
- Euclidian distance
- Cosine distance and cosine similarity
Categorizing text with semantic similarity
- Extracting key phrases
- Extracting and comparing named entities

In this section, we'll discover advanced semantic similarity methods for word, phrase, and sentence similarity. We've already learned how to calculate semantic similarity with spaCy's similarity method and obtained some scores. But what do these scores mean? How are they calculated? Before we look at more advanced methods, first, we'll learn how semantic similarity is calculated.

Understanding semantic similarity

When we collect text data (any sort of data), we want to see how some examples are similar, different, or related. We want to measure how similar two pieces of text are by calculating their similarity scores. Here, the term semantic similarity comes into the picture; semantic similarity is a metric that's defined over texts, where the distance between two texts is based on their semantics.

A metric in mathematics is basically a distance function. Every metric induces a topology on the vector space. Word vectors are vectors, so we want to calculate the distance between them and use this as a similarity score.

Now, we'll learn about two commonly used distance functions: Euclidian distance and cosine distance. Let's start with Euclidian distance.

Euclidian distance

The Euclidian distance between two points in a k-dimensional space is the length of the path between them. The distance between two points is calculated by the Pythagorean theorem. We calculate this distance by summing the difference of each coordinate's square and then taking the square root of this sum. The following diagram shows the Euclidian distance between two vectors, dog and cat:

Press + to interact

What does Euclidian distance mean for word vectors? First, Euclidian distance has no idea of vector orientation; what matters is the vector magnitude. If we take a pen and draw a vector from the origin to the dog point (let's call it dog vector) and do the same for the cat point (let's call it cat vector) and subtract one vector from and other, then the distance is basically the magnitude of this difference vector.

What happens if we add two more semantically similar words (canine, terrier) to dog and make it a text of three words? Obviously, the dog vector will now grow in magnitude, possibly in the same direction. This time, the distance will be much bigger due to geometry (as shown in the following ...

Getting Started

Core Operations with spaCy

Linguistic Features

Rule-Based Matchmaking

Working with Word Vectors and Semantic Similarity

Putting Everything Together: Semantic Parsing with spaCy

Assessment: spaCy Features

Auto-Tagging System for Content Categorization

Customizing spaCy Models

Text Classification with spaCy

spaCy and Transformers

Putting Everything Together: Designing a Chatbot with spaCy

Appendix

Conclusion

Assessment - Machine Learning with spaCy

Advanced Semantic Similarity Methods

Understanding semantic similarity

Euclidian distance