Similarity
Explore how to quantify sentence similarity by converting text into dense vector embeddings using models like BERT and SentenceTransformers. Understand key similarity metrics such as cosine similarity and learn practical applications including semantic search, duplicate detection, document clustering, and question-answer matching using Hugging Face tools in Python.
Sentence similarity is the task of quantifying how similar two pieces of text are, based on their meaning rather than exact word matches.
This capability is foundational for search, plagiarism detection, clustering, question answering, recommendation, and many other NLP applications. In this lesson, you’ll learn what sentence embeddings are, how similarity is calculated, and how to implement semantic similarity and search using Hugging Face and SentenceTransformers.
What is sentence similarity?
At a high level, sentence similarity involves encoding sentences into vectors in a high-dimensional space, where sentences with similar meanings are close together. Instead of comparing strings word-for-word, modern NLP uses semantic embeddings; numeric representations capturing meaning.
For example:
The cat is on the mat.
A cat sits on a rug.
Even though the words differ, their meanings are similar, so their embeddings should be close in vector space.
Sentence embeddings
Sentence embeddings are dense vector representations of sentences produced by models like BERT, RoBERTa, or newer Sentence Transformers.
Each sentence is converted into a fixed-length vector (e.g., 768 dimensions).
Semantically similar sentences have embeddings that are close in this vector space.
Distance or similarity between embeddings is measured using cosine similarity, dot product, or Euclidean distance.
Below is a comparison of these three similarity metrics to help you understand when to use each:
Metric | Range | When to Use |
Cosine similarity | -1 → 1 | Cosine is best when you want direction-only similarity (i.e., how similar or opposite the semantic meaning is) |
Dot product | (-∞ → +∞) | Dot product can be preferable if the length of the vector indicates its importance or frequency (e.g., in retrieval systems) |
Euclidean distance | 0 → ∞ | Euclidean distance is useful if you care about absolute distance and embedding magnitudes |
Cosine similarity
To compare vectors, we most commonly use cosine similarity. Cosine similarity measures the angle between two vectors, ignoring their lengths. Formula:
Values range from -1 to 1. For non-negative embedding spaces or typical ...