How can ZSL techniques be used for cross-modal retrieval?

Zero-shot learning (ZSL) approaches can be utilized for cross-modal retrieval by retrieving data from one modality (such as text) based on queries from another modality (such as pictures) when no training examples are available for direct matching. The key concept is to bridge the semantic gapIn the context of ZSL, a semantic gap refers to the disparity between the representation of data in a machine learning model and the way humans understand and describe that data. between the modalities and execute retrieval using semantic embeddings.

Cross-modal retrieval

Cross-modal retrieval is a field of study in computer science and information retrieval that deals with retrieving information from one type of data using a query from another type of data. It’s also known as cross-modal search or cross-modal matching. It includes looking for and retrieving data from several modalities or types of media. In cross-modal retrieval, we can have a query in one modality (for example, text or picture) and the aim is to locate relevant results in another modality (for example, text documents, photographs, videos, or audio recordings).

Cross-modal retrieval working mechanism
Cross-modal retrieval working mechanism

Retrieval combinations

In the context of cross-modal retrieval, there are various potential retrieval combinations, including:

Text-to-image retrieval

  • This involves taking a text query as input and then finding images that are relevant to that query.

  • This can be beneficial in applications such as image searches based on written descriptions.

Text-to-image retrieval
Text-to-image retrieval

Image-to-text retrieval

  • This involves taking an image as input and then finding captions or textual descriptions that describe the content of that image.

  • This is beneficial in applications such as image captioning and image recognition.

Image-to-text retrieval
Image-to-text retrieval

Audio-visual retrieval

  • This involves taking an audio clip as input and then finding videos or images that match that audio content.

  • This is beneficial in applications such as multimedia search and content-based retrieval.

Audio-visual retrieval
Audio-visual retrieval

Video-text retrieval

  • This involves taking a video clip as input and then finding textual documents that discuss or relate to the content of that video.

  • This is beneficial in applications such as video captioning and surveillance systems.

Video-text retrieval
Video-text retrieval

ZSL techniques

ZSL methods can be categorized into the following taxonomy based on different characteristics and approaches:

  • Attribute-based methods

  • Embedding-based methods

  • Knowledge graph-based methods

  • Generative methods

Note: Check out this Answer on ZSL for a more comprehensive exploration of each ZSL method.

Utilizing ZSL techniques for cross-modal retrieval

Here’s how ZSL approaches can be used for cross-modal retrieval:

  • Modality-specific embedding creation: For each modality, we can create modality-specific embeddings. These embeddings can be generated based on pretrained models, attribute classifiers, or other approaches.

  • Semantic space mapping: We can create a common semantic space in which the various modalities can be compared. In most cases, this space reflects a shared semantic representation of attributes or concepts. We can map the modality-specific embeddings to this shared semantic space using a ZSL approach. For this purpose, we can use methods such as DeViSE, CMT, and generalized ZSL models.

  • Image-to-text and text-to-image mapping: In this shared semantic space, we can map image queries to text representations and vice versa.

    • Text-to-image mapping: This includes taking a text query as input, mapping it to the shared semantic space using text-specific embedding, and then finding the nearest image representations in that space.

    • Image-to-text mapping: We can follow a similar process but in the reverse direction.

  • Semantic similarity calculation: After mapping both the query and the data to the common semantic space, we can compute their similarity using any similarity measure, such as cosine similarity. The retrieval results are the most comparable data examples in the opposite modality.

Question

What is meant by cosine similarity?

Show Answer

Code example

The following code demonstrates a simple ZSL model, trains it on random data, generates a random query, and attempts to find similar items in the dataset based on the cosine similarity between the query image and the dataset text features in a shared space learned by the model:

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
# Example image and text features
image_features = np.random.rand(500, 128)
text_features = np.random.rand(500, 300)
# Basic ZSL model
class ZSLModel(nn.Module):
def __init__(self):
super(ZSLModel, self).__init__()
self.image_mapping = nn.Sequential(
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 32),
nn.ReLU()
)
self.text_mapping = nn.Sequential(
nn.Linear(300, 64),
nn.ReLU(),
nn.Linear(64, 32),
nn.ReLU()
)
def forward(self, image_input, text_input):
image_embedding = self.image_mapping(image_input)
text_embedding = self.text_mapping(text_input)
return image_embedding, text_embedding
# Hyperparameters
epochs = 10
learning_rate = 0.001
# Instantiate model, loss function, and optimizer
model = ZSLModel()
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
# Data convertion
image_tensor = torch.from_numpy(image_features).float()
text_tensor = torch.from_numpy(text_features).float()
# Model training
for epoch in range(epochs):
optimizer.zero_grad()
image_output, text_output = model(image_tensor, text_tensor)
loss = criterion(image_output, text_output)
loss.backward()
optimizer.step()
# Example inference for retrieval
query_image = np.random.rand(1, 128).astype(np.float32)
query_text = np.random.rand(1, 300).astype(np.float32)
# Convertion
query_image_tensor = torch.from_numpy(query_image)
query_text_tensor = torch.from_numpy(query_text)
query_image_embedding, query_text_embedding = model(query_image_tensor, query_text_tensor)
query_image_embedding = query_image_embedding.repeat(text_tensor.shape[0], 1)
# Transpose
query_image_embedding = query_image_embedding.transpose(0, 1)
text_tensor = text_tensor.transpose(0, 1)
# Cosine similarity calculation
similarity_scores = nn.functional.cosine_similarity(query_image_embedding.unsqueeze(1), text_tensor.unsqueeze(0))
# Get and print top similar indices
top_similar_indices = torch.argsort(similarity_scores, descending=True)[:10]
print("Top 10 similar items indices:", top_similar_indices)

Note: Implementing a complete ZSL method for cross-modal retrieval involves intricate steps that can’t be fully encapsulated in a code snippet due to the complexity of neural network architectures, loss functions, and data requirements. However, we’ve provided a simplified example using PyTorch to illustrate this concept.

Code explanation

Here’s the breakdown of the code above:

  • Lines 1–4: We import the necessary libraries.

  • Lines 7–8: These generate random example data for image_features and text_features, simulating a dataset with 500 samples each, where image_features have 128 dimensions and text_features have 300 dimensions.

  • Lines 11–25: These define a model class ZSLModel inheriting from nn.Module. The model contains two separate mapping networks for images and text data within its _init_ method, each with linear layers and ReLU activation functions.

  • Lines 27–30: The forward method specifies how inputs pass through the model.

  • Lines 33–34: Here we configure the training parameters: epochs and learning_rate.

  • Lines 37–39: These initialize the ZSL model, loss function (MSELoss), and optimizer (Adam optimizer).

  • Lines 42–43: Here we convert the example NumPy array data into PyTorch tensors for training.

  • Lines 46–53: These execute a training loop over a specified number of epochs, optimizing the model’s parameters using the Adam optimizer to minimize the MSEMean squared error loss between image_output and text_output embeddings.

  • Lines 56–57: These generate random query data to find similar items in the dataset.

  • Lines 60–61: Here we convert the query data into PyTorch tensors.

  • Lines 63–64: These pass the query data through the trained model to obtain embeddings for both image and text queries.

  • Lines 67–68: Here we adjust the shapes of the embeddings for comparison, transposing and reshaping tensors to compute the cosine similarity.

  • Line 71: These calculate the cosine similarity between the query_image_embedding and the dataset text embeddings (text_tensor).

  • Lines 74–75: These retrieve the top 10 indexes of the most similar items based on cosine similarity and prints these indexes.

  • Expected output: The code prints the indexes of the top 10 items in the dataset that are most similar to the randomly generated query image and text, based on cosine similarity in the shared embedding space learned by the model. These indexes represent the items in the dataset that are considered most similar to the query. The specific values will be numerical indexes pointing to items in the dataset.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved