Customer Service Assistant—Multimodal RAG Interface
Here is the recap of the customer service assistant we worked on:
Suppose you work in a shoe retail company that sells different brands of shoes. All the brand names and details of the shoes are available on the store’s website, but the user has to find them by manually navigating through the entire website. That’s why the company receives thousands of user requests about shoe styles, details, prices, availability, and more. It isn’t easy to handle all these queries manually because it is time-consuming to scroll through the entire list of shoe data. Your goal is to automate this task.
You decide to implement an RAG system using Gemini to achieve accurate results using your own company data. RAG requires a knowledge base to provide relevant and accurate information. The data on the company website can’t be used directly as a knowledge base for the RAG process because there will be scattered data, navigation issues, and dynamic content. We need to structure data to use for the retrieval process. To do this, scrape the website data and store it in a PDF file. All the details of the shoes are in a PDF file, including the shoe name, style, release date, style code, original retail price, store location, and description, along with the image of the shoe.
You decide to implement an RAG system using Gemini to achieve accurate results using your store data. RAG will utilize the PDF that contains all the relevant details of the shoes available in the store, and Gemini will help generate responses to user queries. The LangChain toolkit will also help us structure the process.
Now, if a customer asks a question, the RAG system will extract the relevant information from the store PDF and generate precise answers to customer queries with the help of LangChain and Gemini, ensuring that customers get answers promptly.
We have implemented the following tasks in our course:
RAG with text prompts: Implemented a text-to-text retrieval system using LangChain and Gemini.
RAG with image prompts: Implemented an image-to-text retrieval system using LangChain and Gemini.
We’ll now implement multimodal retrieval, RAG with text and image prompts, for our customer service assistant with the Streamlit interface.
Implement a RAG system to retrieve and generate relevant textual information from text and image inputs.
Input data: Text and image
Generated response: Text
This multimodal approach results in more accurate outputs using the different input data types.
Here is the list of components that we’ll use for this project:
Streamlit for building the interface
LangChain as a toolkit for building the RAG chains
gemini-pro-1.5-latest
for response generation
MobileNetV3-Small
model for image embedding
chromadb
for storing the data
By the end of this project, we’ll be able to build a system that effectively takes text and image inputs to enhance the retrieval and generation of textual responses. This project will advance our understanding of multimodal AI applications using the Gemini model and RAG technique and prepare us to integrate various data types for improved AI-driven responses.
Let’s begin!