Adding Image-to-Text Capabilities with Gemini

Explore how to integrate Google's Gemini for image-to-text capabilities in Python-based chatbots. Learn to configure the API, handle multimodal inputs with Gradio, and combine Gemini with LLMs like Llama to create interactive, multimodal chatbot experiences that process both text and images effectively.

We'll cover the following...

Creating a Gemini API key
Sending images in the chat
Sending images to Gemini
How did we do?
Why not use one model for text and images?

Gemini is a popular multimodal chatbot built by Google. It can take input from various data modalities, such as text, images, charts, PDFs, videos, and audio. We are particularly interested in Gemini’s image-processing capabilities for our use case. A simple use case would be to generate HTML code from the image of a web page. This will greatly enhance our educational chatbot’s capabilities. Let’s begin!

Google AI Studio is a web-based tool designed to prototype and experiment with the Gemini AI models. The AI Studio can be a great place to get started with Gemini, but most importantly, the Studio also allows us to generate an API key that can be used to access Gemini using code.

Creating a Gemini API key

Let’s quickly walk through the API key creation process. Head over to the AI Studio and login. Then, follow the slides below:

import os
import google.generativeai as genai

genai.configure(api_key=os.environ["GEMINI_API_KEY"])

# Create the model
generation_config = {
  "temperature": 1,
  "top_p": 0.95,
  "top_k": 64,
  "max_output_tokens": 8192,
  "response_mime_type": "text/plain",
}

model = genai.GenerativeModel(
  model_name="gemini-2.5-pro",
  generation_config=generation_config,
)

chat_session = model.start_chat(
  history=[
  ]
)

response = chat_session.send_message("Hello!")

print(response.text)

Accessing Gemini using Python

Let’s review the code:

Line 1: We import the google.generativeai library to interact with Google’s Generative AI API.
Line 4: We configure the generative AI client using an API key stored in the environment variable GEMINI_API_KEY. This grants access to the generative AI models.
Lines 7–12: We define a dictionary named generation_config that specifies optional parameters for generating the response. These parameters control aspects like:
- Temperature: Controls randomness (1 being more balanced).
- Top P: Focuses on the most likely tokens (0.95 means high focus).
- Top K: Considers top K most likely next words (64 provides some diversity).
- Max output tokens: Limits the length of the generated text (8192 sets a maximum of 8192 words or sub-words).
- Response Mime Type: Sets the output format (text/plain indicates plain text).
Lines 15–18: We create a GenerativeModel object named model by specifying the model name gemini-2.5-pro and the generation configuration we defined earlier.
Lines 20–23: We initiate a chat session with the model using the start_chat ...

1.Getting Started

2.Foundations of AI Chatbots

3.Building a Generative AI-Powered Chatbot

4.Enhancing Chatbots with Advanced Capabilities

5.Conclusion

Adding Image-to-Text Capabilities with Gemini

Creating a Gemini API key