What is a vector database?

A vector is a numerical representation of some data, feature, or attribute. Depending on the data complexity, the dimensions of a vector can vary from tens to thousands. We can obtain a vector by applying transformation or embedding functions on raw data like text, image, video, audio, and others. These embedding functions may utilize diverse methodologies, including machine learning models, word embeddings, and feature extraction algorithms.

Transforming raw data into vectors
Transforming raw data into vectors

A vector database is a type of database used to store vector embeddings in a specific order. It is designed to handle vector data and provide vector-related operations along with traditional database operations. A key benefit of a vector database is its capability to quickly and precisely retrieve data by measuring their similarity or distance as vectors. Compared to conventional database search methods, a vector database enables us to locate the most similar or relevant data by considering their semantic or contextual meaning.

A visual of a vector database
A visual of a vector database

In a vector database, similar vectors are placed near each other. We can use a vector database to find similar content considering the features of the given content. We can find:

  • similar images to a given image, considering their visual content and style.

  • related documents focusing on the topic and sentiment of a given document.

  • similar products based on ratings and features of a given product.

Finding data in a vector database

Before finding the relevant data from a vector database, we need to convert our query into a query vector to represent the desired information. We’ll use the same vector embedding model we used earlier while obtaining the vector embeddings to do this. Once we get our query vector, we’ll perform a similarity search to find the vectors close to the query vector in the vector database. We can use any similarity measure, e.g., cosine similarity, Hamming distance, Euclidean distance, etc. The output of the similarity search would be a list of vectors sorted based on the highest similarity scores relative to the query vector. We can use these vectors to access the raw data linked to each vector.

Finding similar data in a vector database
Finding similar data in a vector database

Vector database features

Vector database provides many features which make it feasible to use in different AI applications. Let’s discuss some of them:

  • Similarity search: A vector database excels in finding similar data points to a query vector, even when the data points are not the same. This helps us in the recommendation systems for related recommendations.

  • High-dimensional indexing: A vector database is capable of indexing numerous dimensions to identify complex relationships within data. This can help us in discovering genetic patterns and potential treatments for diseases.

  • Scalability: Vector databases are designed to handle huge amounts of vector data efficiently. It helps us manage rapidly increasing social media platform data while maintaining efficient search and recommendation functionalities.

  • Metadata support: Vector databases can store additional information (metadata) alongside vectors for more nuanced searches. This can help us in finding more personalized recommendations for users.

  • Indexing methods: Vector databases utilize different indexing methods to store and access the vectors, such as k-d trees, graph-based index, locality-sensitive hashing (LSH), inverted file (IF), spatial hashing, etc. This can help in implementing faster searches in applications with optimized performance.

Available vector databases

Several vector databases are available, each with its own strengths and weaknesses. Some popular options include:

  • MyScale

  • Pinecone

  • FAISS (Facebook AI Similarity Search)

  • Weaviate

  • Milvus

  • Chroma

Code example

Let’s implement a scenario using a Python library in the following playground:

The customer is always right.
East or west, home is best.
Keep your friends close and your enemies closer.
Don't bite the hand that feeds you.
God helps those who help themselves.
Time heals all wounds.
A journey of a thousand miles begins with a single step.
If it ain't broke, don't fix it.
Out of the frying pan and into the fire.
Silence is golden.
Don't throw the baby out with the bathwater.
Practice what you preach.
Two wrongs don't make a right.
Learn from the past, live for today, hope for tomorrow.
If wishes were horses, beggars would ride.
The apple doesn't fall far from the tree.
If you can't beat them, join them.
There's no smoke without fire.
Out of sight, out of mind.
All good things must come to an end.
Practice makes perfect.
Like father, like son.
No man is an island.
Don't burn your bridges behind you.
Don't judge a book by its cover.
All's fair in love and war.
You can't have your cake and eat it too.
Give credit where credit is due.
A penny for your thoughts.
Idle hands are the devil's workshop.
Birds of a feather flock together.
A fool and his money are soon parted.
Every dog has its day.
A picture is worth a thousand words.
Necessity is the mother of invention.
The more things change, the more they stay the same.
Every man for himself.
Half a loaf is better than none.
Every rose has its thorn.
It's never too late to mend.
A rolling stone gathers no moss.
You reap what you sow.
If you want something done right, do it yourself.
A penny saved is a penny earned.
Time flies when you're having fun.
Money talks.
When the going gets tough, the tough get going.
Don't put all your eggs in one basket.
You can't make an omelet without breaking eggs.
A stitch in time saves nine.
One man's meat is another man's poison.
Keep your friends close and your enemies closer.Spare the rod, spoil the child.
A bird in the hand is worth two in the bush.
It's the journey, not the destination.
Rome wasn't built in a day.
Kill two birds with one stone.
You can't teach an old dog new tricks.
Curiosity killed the cat.
People who live in glass houses shouldn't throw stones.
Look on the bright side.
Let sleeping dogs lie.
Actions speak louder than words.
All good things come to those who wait.
Better the devil you know than the devil you don't.
A watched pot never boils.
It's darkest just before the dawn.
One man's loss is another man's gain.
If you play with fire, you'll get burned.
Life is a journey, not a destination.
It's better to have loved and lost than never to have loved at all.
Easy come, easy go.
It's better to be safe than sorry.
All that glitters is not gold.
Don't cry over spilled milk.
Charity begins at home.
You can't please everyone.
The pen is mightier than the sword.
Familiarity breeds contempt.
Life is what happens while you're busy making other plans.
If at first you don't succeed, try, try again.
Once bitten, twice shy.
Fool me once, shame on you; fool me twice, shame on me.
Laughter is the best medicine.
You are what you eat.
Slow and steady wins the race.
Beggars can't be choosers.
Honesty is the best policy.
It's never too late to learn.
If the shoe fits, wear it.
He who hesitates is lost.
You can catch more flies with honey than with vinegar.
Revenge is a dish best served cold.
The early bird gets the worm.
Don't sweat the small stuff.
The proof is in the pudding.
It's always darkest before the dawn.
You can lead a horse to water, but you can't make it drink.
The squeaky wheel gets the grease.
Don't put the cart before the horse.
It takes two to tango.
One man's trash is another man's treasure.
The road to hell is paved with good intentions.
Every man has his price.
If you can't stand the heat, get out of the kitchen.
You can't always get what you want.
It ain't over till the fat lady sings.
Two heads are better than one.
Fortune favors the bold.
All is fair in love and war.
The bigger they are, the harder they fall.
Haste makes waste.
All roads lead to Rome.
An apple a day keeps the doctor away.
Money can't buy happiness.
The best things in life are free.
Ignorance is bliss.
Every cloud has a silver lining.
Cleanliness is next to godliness.
Home is where the heart is.
Don't put off until tomorrow what you can do today.
No pain, no gain.
What goes around comes around.
When in doubt, don't.
The end justifies the means.
It's not over until the fat lady sings.
Necessity knows no law.
Absence makes the heart grow fonder.
Every dog has his day.
Beauty is in the eye of the beholder.
It's the squeaky wheel that gets the grease.
Too many cooks spoil the broth.
A friend in need is a friend indeed.
Still waters run deep.
There's no place like home.
Don't try to run before you can walk.
There's no such thing as a free lunch.
Look before you leap.
Nothing ventured, nothing gained.
Where there's smoke, there's fire.
Make hay while the sun shines.
Better late than never.
Great minds think alike.
Prevention is better than cure.
Money doesn't grow on trees.
Clothes don't make the man.
Every man is the architect of his own fortune.
Clothes make the man.
To each his own.
Don't count your chickens before they hatch.
The early bird catches the worm.
Time is money.
An ounce of prevention is worth a pound of cure.
First impressions are the most lasting.
The grass is always greener on the other side.
Let bygones be bygones.
Where there's a will, there's a way.
Blood is thicker than water.
Patience is a virtue.
Good things come to those who wait.
Better safe than sorry.
A picture paints a thousand words.
To err is human, to forgive divine.
When in Rome, do as the Romans do.
Misery loves company.
Beauty is only skin deep.
Finding similar vectors using cosine similarity

Code explanation

Let’s understand the above implementation:

  • Lines 1–3: We import the required libraries.

  • Line 6: We load the Sentence Transformers model paraphrase-MiniLM-L6-v2 which is a lightweight version of BERT fine-tuned for paraphrase detection.

  • Lines 9–10: We open the sentences.txt file in read mode and use a context manager to ensure the file is properly closed after use. We read all lines from the file and store them in the sentences list.

  • Lines 13–14: We define the sentence_to_vector() function that takes a sentence as input, encodes it using the Sentence Transformers model, and returns its vector representation. Inside this function, we call the model.encode() to compute embeddings for the input sentence.

  • Line 17: We generate vector representations for all sentences in the sentences list using the sentence_to_vector() function and stores them in the data NumPy array.

  • Line 20: We initialize an Annoy index with 512 dimensions to match the dimensionality of the sentence vectors produced by the Sentence Transformers model. We specify angular as the distance metric, which corresponds to cosine similarity when dealing with high-dimensional vectors.

  • Lines 23–24: We use a for loop to iterate over each sentence vector in the data array. In the loop, we add each sentence vector to the Annoy index with its corresponding index i.

  • Line 27: We build the Annoy index with 10 trees for efficient nearest neighbor search.

  • Lines 30–31: We define a query sentence for similarity search and convert the query sentence to its vector representation using the sentence_to_vector() function.

  • Lines 33–35: We specify the number of nearest neighbors k to retrieve and perform a similarity search to find the k nearest neighbors to the query vector in the Annoy index.

  • Lines 38–42: Lastly, we use the for loop to print the found nearest neighbors of the given query sentence along with their indexes and cosine similarity scores. The cosine similarity score is computed as the dot product of the query vector and the neighbor vector, normalized by the product of their magnitudes.

Applications of vector databases

Vector databases can be utilized in various domains, such as natural language processing, computer vision, recommendation systems, etc. Let’s discuss some applications of it:

  • Retrieval-augmented generation (RAG) architecture: A vector database is used in a RAG architecture to retrieve relevant documents. This helps the LLM to generate precise and factual responses and eliminate hallucinationsHallucination, in context of a large language model, is refer to the phenomenon where model generates factually incorrect response or unrelated to the given prompt..

  • Recommendation engines: A vector database can be used in an e-commerce website to find similar products based on the buying behaviour of users. This helps in enhancing the customer experience and engagement based on vector similarity.

  • Similarity search: A vector database is used to find similar content for content retrieval applications, such as natural language processing, images, speech recognition, and more.

  • Fraud detection: A vector database can be utilized by financial institutions to analyze financial transactions and identify fraudulent patterns. They can compare transaction data, such as amount, location, etc., to historical fraud vectors to flag suspicious activity.

  • Drug discovery: Vector databases can help in drug discovery by analyzing the properties of molecules. Researchers can prioritize promising leads by comparing potential drug candidates to known effective drugs.

Comparison with traditional databases

A traditional database organizes data in tabular format and uses value-based indexing. When queried, it returns the exact match to the query. In contrast, a vector database stores data as embeddings and provides vector search, which retrieves query results based on similarity metrics rather than exact matches. It surpasses traditional databases by specializing in vector embeddings.

Additionally, a vector database outperforms traditional databases in various applications of AI and machine learning like similarity search, recommendations, and chatbot applications. It facilitates high-dimensional search, customized indexing, scalability, flexibility, and efficiency.

Copyright ©2024 Educative, Inc. All rights reserved