Gesture recognizer in deep learning

Deep learning has paved the way for numerous revolutionary techniques in the field of computer vision. With the possibility of making better models each day, it has become possible for us to apply such models in the domains of classification, recognition, and prediction. This is immensely useful in quite a lot of real-world applications that we'll see later on. Gesture recognition in images is one such example and precisely what we'll be targeting in this answer!

Gesture recognition

The process of recognizing what particular position our hand is in and what gesture it may indicate is known as gesture recognition. We can submit unlabelled images through gesture recognition applications, and a trained model can then predict what gesture the picture depicts.

Gesture recognizer model

The model we will use for our application is a computer vision model from the framework MediaPipe. We can download the gesture_recognizer.task file from their official documentation. This task file serves as the trained model for our application, and we can simply use it to recognize patterns in new images.

Note: You can download the model called gesture_recognizer.task herehttps://ai.google.dev/edge/mediapipe/solutions/vision/gesture_recognizer and reference it in your code.

Hand landmarks

A crucial concept in gesture recognition is first identifying the coordinates of the hand and if it even exists. Hand landmarks are specific points on the hand used for tracking hand gestures.

We can then extract the hand landmarks, such as fingertips and palm center, to analyze and interpret various hand gestures accurately.

if hand_landmarks_list:
    copied_image = img_to_process.copy()
    for landmark in hand_landmarks_list:
        mp_drawing.draw_landmarks(
            copied_image,
            landmark,
            mp_hands.HAND_CONNECTIONS,
            mp_drawing_styles.get_default_hand_landmarks_style(),
            mp_drawing_styles.get_default_hand_connections_style()
        )
    base_options = python.BaseOptions(model_asset_path = 'gesture_recognizer.task')
    options = vision.GestureRecognizerOptions(base_options = base_options)
    recognizer = vision.GestureRecognizer.create_from_options(options)
    image = mp.Image.create_from_file(img_file)
    recognition_result = recognizer.recognize(image)
    top_gesture = recognition_result.gestures[0][0]
    gesture_prediction = f"{top_gesture.category_name} ({top_gesture.score:.2f})"
    cv2.putText(copied_image, gesture_prediction, (10, copied_image.shape[0] - 20) , cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 0, 255), 2)
    cv2.imshow("Guess the gesture!", copied_image)
    cv2.waitKey(0)
    cv2.destroyAllWindows()
else:
    print("No hands were detected!")

If the results variable contains hand landmarks, we proceed with visualizing the landmarks on the input image.
We create a copy of the input image, copied_image, to draw the landmarks on it. Next, we use the mp_drawing.draw_landmarks function to draw the hand landmarks using hand connections and landmark styles.
We specify the path to our model and the required options using python.BaseOptions and vision.GestureRecognizerOptions. Our model is referenced through "gesture_recognizer.task".
We initialize a gesture recognition model using the vision.GestureRecognizer class and recognize the gesture based on the hand landmarks.
The recognized gesture is stored in gesture_prediction. The recognized gesture is then displayed using cv2.putText, showing the gesture category name and its corresponding score.
Finally, we display the copied_image with the recognized gesture using cv2.imshow. The user can view the image with the recognized gesture, and it will remain open until a key is pressed.
If no hands are detected in the image, we print "No hands were detected!".

Executable code

Yay, we've completed our code walkthrough and can now see the code in action. You can edit the code window below and click "Run" to see the results.

import cv2
import mediapipe as mp
from mediapipe.framework.formats import landmark_pb2
from mediapipe.tasks import python
from mediapipe.tasks.python import vision

img_file = "image2.png"
img_to_process = cv2.imread(img_file)

hands = mp.solutions.hands.Hands(min_detection_confidence = 0.5, min_tracking_confidence = 0.5)

rgb_format_img = cv2.cvtColor(img_to_process, cv2.COLOR_BGR2RGB)

results = hands.process(rgb_format_img)

hand_landmarks_list = []

if results.multi_hand_landmarks:
    for hand_landmarks in results.multi_hand_landmarks:
        hand_landmarks_protocol = landmark_pb2.NormalizedLandmarkList()
        hand_landmarks_protocol.landmark.extend([
            landmark_pb2.NormalizedLandmark(x = landmark.x, y = landmark.y, z = landmark.z) for landmark in hand_landmarks.landmark
        ])
        hand_landmarks_list.append(hand_landmarks_protocol)

mp_drawing_styles = mp.solutions.drawing_styles
mp_drawing = mp.solutions.drawing_utils
mp_hands = mp.solutions.hands

if hand_landmarks_list:
    copied_image = img_to_process.copy()

    for landmark in hand_landmarks_list:
        mp_drawing.draw_landmarks(
            copied_image,
            landmark,
            mp_hands.HAND_CONNECTIONS,
            mp_drawing_styles.get_default_hand_landmarks_style(),
            mp_drawing_styles.get_default_hand_connections_style()
        )

    base_options = python.BaseOptions(model_asset_path = 'gesture_recognizer.task')
    options = vision.GestureRecognizerOptions(base_options = base_options)
    recognizer = vision.GestureRecognizer.create_from_options(options)
    image = mp.Image.create_from_file(img_file)
    recognition_result = recognizer.recognize(image)
    top_gesture = recognition_result.gestures[0][0]

    gesture_prediction = f"{top_gesture.category_name} ({top_gesture.score:.2f})"
    cv2.putText(copied_image, gesture_prediction, (10, copied_image.shape[0] - 20) , cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 0, 255), 2)

    cv2.imshow("Guess the gesture!", copied_image)
    cv2.waitKey(0)
    cv2.destroyAllWindows()
else:
    print("No hands were detected!")

Use cases	Explanation
Human-computer interaction	Enables users to interact with computers, mobiles, or devices using hand gestures
	Used for gesture-based navigation and performing tasks
Gaming	Enhances gaming experiences by allowing players to control characters and actions
	Popular in motion-controlled games
Virtual reality	Enables users to interact with virtual environments using hand gestures
	Provides a natural and intuitive way to pick up objects, manipulate virtual elements, and navigate
Sign language interpretation	Converts sign language gestures into text or speech, aiding communication
Augmented reality	Allows users to interact with digital content overlaid on the real world using hand gestures
Assistive technology	Helps individuals with physical disabilities to control devices

Gesture recognizer in deep learning

Gesture recognition

MediaPipe and deep learning

Gesture recognizer model

Hand landmarks

Code walkthrough

Executable code

Gesture recognition demonstration

Real-life applications