Yes, embeddings are not limited to text. Many frameworks support embeddings for diverse data types like images, audio, and video. These are particularly useful in multimedia search and recommendation systems.
How to generate text embeddings with OpenAI’s API in Python
Text embeddings transform how AI models understand and respond to human language, bridging the gap between AI and human communication. Before diving into generating text embeddings, it’s important to understand what they are.
What are text embeddings?
Text embeddings are a numerical representation of the text. They represent words or phrases as vectors in a high-dimensional space that holds the underlying meaning of the text. These embeddings allow the large language models (LLMs) to understand and process relationships between words in sentences. For example, the embedding vector of “felines say” will be more similar to the embedding vector of “meow” than that of “roar.”
Why do we convert text to embeddings? As computers interpret data as numbers, by turning text into embeddings, we provide them with a way to interpret and analyze complex human language. This capability enables them to handle complex language tasks such as clustering, classification, topic identification, etc., more effectively.
OpenAI embedding models
OpenAI offers a range of embedding models for different performance and cost needs. Let’s explore them one by one:
text-embedding-ada-002: This is an earlier model with an embedding size of 1536. It performs well for standard tasks. However, it offers less multilingual accuracy (scores 31.4% on the benchmark) compared to the newer models.MIRACL Multilingual Information Retrieval Across a Continuum of Languages text-embedding-3-small: This advanced embedding model, released in 2024, offers significant improvements in performance for multilingual tasks (scores 44% on the MIRACL benchmark). It is a highly efficient and cost-effective model with a reduced embedding size of 512 dimensions.text-embedding-3-large: This is the most powerful embedding model, achieving the highest performance for complex tasks (54.9% on MIRACL benchmark). It offers a large embedding size of up to 3072 dimensions, which supports more detailed representations but comes at a higher cost.
We’ll use text-embedding-3-small in this Answer due to its balance of performance, efficiency, and cost-effectiveness for multilingual applications. Now, let’s look at how to generate text embeddings using OpenAI’s API in Python.
Setting up the environment
Before we begin, we need to install the OpenAI Python library on our system. We can install it using the command:
pip install openai
After that, we need an OpenAI API key to use the embedding models. Now, we are all set to use them for our tasks.
Generating text embeddings
Let’s begin by generating the embeddings for the sentence “Educative answers section is helpful.”
from openai import OpenAIimport os# Initializing OpenAI object by providing an OpenAI keyclient = OpenAI(api_key = os.environ["OPENAI_KEY"])response = client.embeddings.create(input = "Educative answers section is helpful",model= "text-embedding-3-small")print(response)
Code explanation
Lines 1–3: We import the
OpenAIclass from theopenailibrary to access the embedding models andosfor accessing environment variables.Lines 5–7: We initialize the
OpenAIobject as aclientusing the OpenAI API key.Line 9: We used the
embedding.create()function to generate embeddings.Line 10: We provide the sentence “Educative answers section is helpful,” for which we need to generate embeddings in the
inputparameter.Line 11: We provide the model name in the
modelparameter.Line 14: We print the embeddings generated by the model.
After executing the code, we can see that our model text-embedding-3-small successfully generated embeddings that capture all the necessary details for the provided sentence.
Finding similarity between texts
Embedding models are also used to find
In this example, we’ll use the dot product to find the similarity between the phrases “feline friends say” and “meow.”
from openai import OpenAIimport numpy as npimport os# Initializing OpenAI object by providing an OpenAI keyclient = OpenAI(api_key = os.environ["OPENAI_KEY"])# Generating embeddings for textresponse = client.embeddings.create(input = ["feline friends say", "meow"],model="text-embedding-3-small")# Extracting embedding of each textembedding_a = response.data[0].embeddingembedding_b = response.data[1].embedding# Finding similarity between embeddings using the dot productsimilarity_score = np.dot(embedding_a, embedding_b)print(similarity_score)
Code explanation
Lines 1–3: We import the required libraries.
Lines 6–8: We initialize the
OpenAIobject as aclientusing the OpenAI API key.Lines 11–14: We used the
embeddings.create()function to generate embeddings for the two phrases “feline friends say” and “meow” using thetext-embedding-3-smallmodel. The function returns a response that includes the embeddings for both phrases.Lines 17–18: We extract the embeddings from the
responseobject for each phrase inembedding_aandembedding_bvariables.Line 21: We calculate the similarity score of two phrases by taking the dot product of
embedding_aandembedding_b.Line 23: We print the similarity score, which tells us how close the two phrases are semantically.
Implementing semantic search
Let’s implement a simple semantic search system that compares embedding vectors to find the top-n most similar items in a dataset.
from openai import OpenAIfrom sklearn.metrics.pairwise import cosine_similarityimport os# Initializing OpenAI object by providing an OpenAI keyclient = OpenAI(api_key = os.environ["OPENAI_KEY"])# Defining get_embedding function which returns embedding of the given textdef get_embedding(text, model):return client.embeddings.create(input = [text], model=model).data[0].embedding# Example datasetdataset = ["sparrow", "carrot", "lion", "peas", "parrot"]# Generating embedding of each element in dataset and storing it in dictionary#in following pattern: "key: sparrow and value: embedding for sparrow"dataset_embeddings = {word: get_embedding(word, model='text-embedding-3-small') for word in dataset}# defining search fucntion which find the top-3 matches against the search querydef search(dataset_embeddings, query, n=3):# Generating embeddings of queryquery_embedding = get_embedding(query, model='text-embedding-3-small')# Finding similarity of query with every element of dataset and storing it in dictionarysimilarity = {word: cosine_similarity([embedding], [query_embedding])[0][0] for word, embedding in dataset_embeddings.items()}# Sorting the dictionary in descending order to get the top n resultsresult = sorted(similarity.items(), key=lambda item: item[1], reverse=True)[:n]return resultresults = search(dataset_embeddings, "cat", n=3)# Printing top-n search results for the queryfor word, similarity in results:print(f"{word}: {similarity}")
Code explanation
Lines 1–3: We import the required libraries.
Lines 6–8: We initialize the
OpenAIobject as aclientusing the OpenAI API key.Lines 11–12: We define a
get_embedding()function that finds embeddings of the given text using the specified model.Lines 15–19: We generated a sample dataset and created embeddings for each item.
Lines 22–31: We define the
search()function, which first finds the embedding of the query and then uses cosine similarity to find the similarity of a query with each dataset item. At last, we sort the results in descending order and extract the top-n most similar items.Line 33: We call the
search()function to find the top-3 similar items for query "cat".Lines 36–37: We print the top 3 results for the query.
Conclusion
Text embeddings offer an incredible range of capabilities in NLP. They provide a wide range of applications by capturing the true essence of text with just a few lines of Python code. Whether we aim to build a recommendation system, categorize documents, or visualize conceptual relationships, OpenAI’s embeddings are an invaluable resource in our array of tools.
Frequently asked questions
Haven’t found what you were looking for? Contact Us
Can we generate embeddings for media like images, audio, or video?
What is the range for the cosine similarity score?
Free Resources