How to generate text embeddings with OpenAI’s API in Python

Text embeddings transform how AI models understand and respond to human language, bridging the gap between AI and human communication. Before diving into generating text embeddings, it’s important to understand what they are.

What are text embeddings?

Text embeddings are a numerical representation of the text. They represent words or phrases as vectors in a high-dimensional space that holds the underlying meaning of the text. These embeddings allow the large language models (LLMs) to understand and process relationships between words in sentences. For example, the embedding vector of “felines say” will be more similar to the embedding vector of “meow” than that of “roar.”

Why do we convert text to embeddings? As computers interpret data as numbers, by turning text into embeddings, we provide them with a way to interpret and analyze complex human language. This capability enables them to handle complex language tasks such as clustering, classification, topic identification, etc., more effectively.

OpenAI embedding models

OpenAI offers a range of embedding models for different performance and cost needs. Let’s explore them one by one:

text-embedding-ada-002: This is an earlier model with an embedding size of 1536. It performs well for standard tasks. However, it offers less multilingual accuracy (scores 31.4% on the MIRACLMultilingual Information Retrieval Across a Continuum of Languages benchmark) compared to the newer models.
text-embedding-3-small: This advanced embedding model, released in 2024, offers significant improvements in performance for multilingual tasks (scores 44% on the MIRACL benchmark). It is a highly efficient and cost-effective model with a reduced embedding size of 512 dimensions.
text-embedding-3-large: This is the most powerful embedding model, achieving the highest performance for complex tasks (54.9% on MIRACL benchmark). It offers a large embedding size of up to 3072 dimensions, which supports more detailed representations but comes at a higher cost.

We’ll use text-embedding-3-small in this Answer due to its balance of performance, efficiency, and cost-effectiveness for multilingual applications. Now, let’s look at how to generate text embeddings using OpenAI’s API in Python.

Setting up the environment

Before we begin, we need to install the OpenAI Python library on our system. We can install it using the command:

Code explanation

Lines 1–3: We import the OpenAI class from the openai library to access the embedding models and os for accessing environment variables.
Lines 5–7: We initialize the OpenAI object as a client using the OpenAI API key.
Line 9: We used the embedding.create() function to generate embeddings.
Line 10: We provide the sentence “Educative answers section is helpful,” for which we need to generate embeddings in the input parameter.
Line 11: We provide the model name in the model parameter.
Line 14: We print the embeddings generated by the model.

After executing the code, we can see that our model text-embedding-3-small successfully generated embeddings that capture all the necessary details for the provided sentence.

Finding similarity between texts

Embedding models are also used to find semanticRelating to meaning in language or logic. similarity between two phrases. This is done by finding the dot product or cosine similarity of embedding vectors. If two phrases are semantically similar, their embeddings will be closer in vector space, resulting in a higher similarity score.

In this example, we’ll use the dot product to find the similarity between the phrases “feline friends say” and “meow.”

from openai import OpenAI
import numpy as np
import os
# Initializing OpenAI object by providing an OpenAI key
client = OpenAI(
    api_key = os.environ["OPENAI_KEY"]
)
# Generating embeddings for text
response = client.embeddings.create(
    input = ["feline friends say", "meow"],
    model="text-embedding-3-small"
    )
# Extracting embedding of each text
embedding_a = response.data[0].embedding
embedding_b = response.data[1].embedding
# Finding similarity between embeddings using the dot product
similarity_score = np.dot(embedding_a, embedding_b)
print(similarity_score)

Code explanation

Lines 1–3: We import the required libraries.
Lines 6–8: We initialize the OpenAI object as a client using the OpenAI API key.
Lines 11–14: We used the embeddings.create() function to generate embeddings for the two phrases “feline friends say” and “meow” using the text-embedding-3-small model. The function returns a response that includes the embeddings for both phrases.
Lines 17–18: We extract the embeddings from the response object for each phrase in embedding_a and embedding_b variables.
Line 21: We calculate the similarity score of two phrases by taking the dot product of embedding_a and embedding_b.
Line 23: We print the similarity score, which tells us how close the two phrases are semantically.

Implementing semantic search

Let’s implement a simple semantic search system that compares embedding vectors to find the top-n most similar items in a dataset.

from openai import OpenAI
from sklearn.metrics.pairwise import cosine_similarity
import os
# Initializing OpenAI object by providing an OpenAI key
client = OpenAI(
    api_key = os.environ["OPENAI_KEY"]
)
# Defining get_embedding function which returns embedding of the given text
def get_embedding(text, model):
   return client.embeddings.create(input = [text], model=model).data[0].embedding
# Example dataset
dataset = ["sparrow", "carrot", "lion", "peas", "parrot"]
# Generating embedding of each element in dataset and storing it in dictionary 
#in following pattern: "key: sparrow and value: embedding for sparrow"
dataset_embeddings = {word: get_embedding(word, model='text-embedding-3-small') for word in dataset}
# defining search fucntion which find the top-3 matches against the search query
def search(dataset_embeddings, query, n=3):
  # Generating embeddings of query
    query_embedding = get_embedding(query, model='text-embedding-3-small')
    # Finding similarity of query with every element of dataset and storing it in dictionary
    similarity = {word: cosine_similarity([embedding], [query_embedding])[0][0] for word, embedding in dataset_embeddings.items()}
    # Sorting the dictionary in descending order to get the top n results
    result = sorted(similarity.items(), key=lambda item: item[1], reverse=True)[:n]
    return result
results = search(dataset_embeddings, "cat", n=3)
# Printing top-n search results for the query
for word, similarity in results:
    print(f"{word}: {similarity}")

Code explanation

Lines 1–3: We import the required libraries.
Lines 6–8: We initialize the OpenAI object as a client using the OpenAI API key.
Lines 11–12: We define a get_embedding() function that finds embeddings of the given text using the specified model.
Lines 15–19: We generated a sample dataset and created embeddings for each item.
Lines 22–31: We define the search() function, which first finds the embedding of the query and then uses cosine similarity to find the similarity of a query with each dataset item. At last, we sort the results in descending order and extract the top-n most similar items.
Line 33: We call the search() function to find the top-3 similar items for query "cat".
Lines 36–37: We print the top 3 results for the query.

Conclusion

Text embeddings offer an incredible range of capabilities in NLP. They provide a wide range of applications by capturing the true essence of text with just a few lines of Python code. Whether we aim to build a recommendation system, categorize documents, or visualize conceptual relationships, OpenAI’s embeddings are an invaluable resource in our array of tools.

When working with large datasets, storing and searching through embeddings can become inefficient if we are using basic storage solutions. We need to look for more optimized storage solutions like Vector databases, which are specifically designed for embedding-based queries and large-scale similarity searches. Explore our course Vector Databases: From Embeddings to Applications to learn how to structure and store embeddings efficiently in vector databases. This course also covers embedding techniques for various data types, including audio and video, with hands-on implementation to build practical applications.

Frequently asked questions

Haven’t found what you were looking for? Contact Us

Can we generate embeddings for media like images, audio, or video?

Yes, embeddings are not limited to text. Many frameworks support embeddings for diverse data types like images, audio, and video. These are particularly useful in multimedia search and recommendation systems.

What is the range for the cosine similarity score?

Cosine similarity ranges from -1 to 1, with 1 indicating identical vectors, 0 indicating no correlation, and -1 indicating opposite vectors.

Free Resources

How to generate text embeddings with OpenAI’s API in Python

What are text embeddings?

OpenAI embedding models

Setting up the environment

Generating text embeddings

Code explanation

Finding similarity between texts

Code explanation

Implementing semantic search

Code explanation

Conclusion

Frequently asked questions

Can we generate embeddings for media like images, audio, or video?

What is the range for the cosine similarity score?