When a system or model processes information from various modalities or sources, its ability to explain its decisions or predictions is known as multimodal explainability. Any type of information, including text, photos, audio, video, sensor data, and other types, is referred to as a modality in this context. Multimodal explainability becomes necessary when artificial intelligence (AI) systems or models are created to make decisions or predictions using input data that is represented in several ways or comes from multiple sources. Multimodal explainability can be used to explain how a system combines textual and spoken inputs to produce responses or perform actions in natural language processing tasks using spoken language, such as voice assistants.
Note: If you’re interested in learning more about multimodal learning or modality, please refer to this What is multimodal deep learning? Answer.
The goal of multimodal explainability is to enhance transparency, accountability, and trust in AI systems, especially when they make complex decisions that involve multiple types of input data. It allows users, stakeholders, or regulatory authorities to understand why a system reached a particular decision, which is important for diagnosing errors, identifying biases, and ensuring ethical and responsible AI deployment.
In order to explain decisions made by AI systems that process data from various modalities or sources, multimodal explainability offers meaningful explanations. An outline of the fundamental working of multimodal explainability is provided below:
Data integration: The first step toward multimodal explainability is the merging of data from different modalities. Text, photos, music, video, sensor data, structured data, and any other type of information can all be included in these modalities. With the goal of arriving at a decision or prediction, the AI system compiles and analyses this data.
A decision by the AI model: The AI model receives the integrated multimodal data as input and produces a decision or prediction. It can be any computer model, such as a neural network or machine learning method. A classification label, a suggestion, or a course of action could all be examples of this decision.
Explanation generation:
Feature attribution: Calculating the contribution of each modality or feature to the final decision is a popular method for multimodal explainability. The impact of each modality can be evaluated using a variety of strategies, including gradient-based methods and feature importance scores.
Attention mechanisms: To indicate which portions of the input data the model concentrated on while generating the decision, attention methods can be used. These mechanisms are able to demonstrate the relative significance of several modalities.
Local explanations: Explanations are sometimes produced locally, which means they apply to a particular situation or choice. The goal of local explanation techniques is to shed light on the reasoning behind the AI model’s decision of action for a specific input.
Visualization: Users are frequently provided with explanations in a format that is comprehensible to humans. To help consumers understand how the various modalities influenced the decision, this could entail interactive tools, textual descriptions, or visualizations. Heat maps, saliency maps, and text summaries are a few examples of visualizations.
User interaction: User-centric multimodal explainability is necessary. Users can interact with the explanations to get insights, confirm choices, or spot possible problems. Examples of these users are domain experts and end users. In order to enhance the explanations and the underlying AI model, user feedback might be quite helpful.
Evaluation and validation: Multimodal explainability approaches are often determined based on three criteria: utility, interpretability, and accuracy. Users can evaluate how well the explanations match their expectations and domain knowledge. The accuracy of feature attribution or attention mechanisms can also be assessed using objective measurements.
Improvement iteration: The process of multimodal explainability is iterative. Explanations can be enhanced and modified to more effectively fulfill their intended function when users’ feedback is gathered and as AI models are improved. The explainability system performs better overall because of this iterative feedback loop.
Ethical considerations: Ethical issues like fairness and privacy should be taken into account at every stage to make sure the justifications don’t reinforce biases or compromise private data from many modalities.
The code below defines two classifiers: one for text and the other for images. It checks to see if the text contains the word “positive” and whether the image is bright or dark, and if so, it prints the explanations:
import numpy as np# Example text data and classifiertext_data = "This is a positive review."text_classifier = lambda text: 1 if "positive" in text else 0# Example image data and classifierimage_data = np.array([[0, 0, 0, 255, 255],[0, 0, 0, 255, 255],[0, 0, 0, 0, 0],[0, 0, 0, 0, 0],[0, 0, 0, 0, 0]])image_classifier = lambda image: 1 if np.mean(image) > 100 else 0# Function to provide explanationsdef explain_multimodal(text_data, image_data):# Text explanationtext_explanation = "Contains the word 'positive'" if text_classifier(text_data) else "Does not contain the word 'positive'"# Image explanationimage_explanation = "Bright image" if image_classifier(image_data) else "Dark image"return {"text_explanation": text_explanation,"image_explanation": image_explanation,}# Get multimodal explanationexplanation = explain_multimodal(text_data, image_data)# Print the explanationsprint("Text Explanation:", explanation["text_explanation"])print("Image Explanation:", explanation["image_explanation"])
The above code’s explanation is provided here:
Line 4: A text_data
variable containing the text "This is a positive review."
.
Line 5: A text_classifier
lambda function that classifies the text as positive if it contains the word "positive"
.
Lines 8–12: The image_data
variable containing a simple grayscale image represented as a 2D numpy
array.
Line 13: There is an image_classifier
lambda function that classifies the image:
If the mean pixel intensity is greater than 100
then it classifies as a "Bright image"
.
Otherwise, it classifies it as a "Dark image"
.
Lines 16–20: The explain_multimodal
function takes both text and image data as input.
For the text modality, it checks if the text contains the word "positive"
using the text_classifier
and generates an explanation based on the result.
For the image modality, it calculates the mean pixel intensity using the image_classifier
and generates an explanation based on the result.
Lines 21–24: The function returns a dictionary with two explanations: one for the text modality ("text_explanation"
) and one for the image modality ("image_explanation"
).
It might be difficult to explain how various modalities are used to make a decision because each modality may have unique qualities and significance.
It may be necessary to carefully analyze how changes in one modality impact the decision because several modalities may be interdependent.
When several modalities are combined, the resulting high-dimensional data can make it difficult to find relevant features and interactions.
For nonexperts, the explanations offered should be clear and practical.
Free Resources