Computer vision involves extracting information from visual data and allows us to perform complex tasks such as classification, prediction, recognition, and much more! In this Answer, we'll look at how to detect text using Tesseract in media, a classic optical character recognition application.
Optical character recognition (OCR), is a revolutionary technology that enables machines to interpret and convert images of text into machine-readable formats. It allows us to utilize the potential of printed or handwritten text.
Simply put, the goal of OCR is to convert the human perception of characters and convert them into machine-encoded text.
The concept of optical character recognition is used in text detection, where we aim to identify and recognize the text found within an image or a video. We will look into its implementation and applications shortly.
In this Answer, we will perform text detection using a Python library named Tesseract. Tesseract is an open-source OCR engine developed by Google that allows the conversion of text in media to machine-encoded text and is known to be efficient and accurate.
We'll learn how to load an image, detect its text, and visualize it.
from pytesseract import *import cv2
Let's start by importing the necessary modules.
cv2
is used for image processing
pytesseract
is used for text detection
def process_image(image_path):img = cv2.imread(image_path)rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)save_text = image_to_data(rgb, output_type=Output.DICT)
The process_image
function takes an image path as input and performs text detection on the image.
It reads the image using cv2.imread
, converts it to RGB format using cv2.cvtColor
, and then uses image_to_data
from pytesseract
to extract the text data in dictionary format. We do this so that multiple words can be catered too.
for i in range(0, len(save_text["text"])):x = save_text["left"][i]y = save_text["top"][i]w = save_text["width"][i]h = save_text["height"][i]text = save_text["text"][i]confidence_level = int(save_text["conf"][i])
We then create a loop to iterate over each detected text block in the image.
Next, we extract the bounding box coordinates i.e. (x, y) and dimensions i.e. (w,y) from the save_text
dictionary. Along with that, we get the detected text and its confidence level for each text block.
if confidence_level > 75:cv2.rectangle(img, (x, y), (x + w, y + h), (0, 0, 0), 2)(text_width, text_height), _ = cv2.getTextSize(text, cv2.FONT_HERSHEY_SIMPLEX, 0.6, 1)cv2.rectangle(img, (x, y - text_height - 5), (x + text_width, y), (255, 255, 255), -1) cv2.putText(img, text, (x, y - 5), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 0, 0), 1)
Using the if
statement, the code filters out weak confidence text. If the confidence_level
is greater than 75 (can be changed), we can draw a rectangle around the detected text and put the text on the image using cv2.rectangle
and cv2.putText
with a black color and a white background.
return img
We finally return the image on which the text and boxes have been identified.
if __name__ == "__main__":input_image_path = 'text7.png'processed_image = process_image(input_image_path)cv2.imshow("Image", processed_image)cv2.waitKey(0)
Finally, we create our main
function. The process_image
function is called with the image path of our choice, and the processed image is displayed using cv2.imshow
.
The window displaying the image is kept open until any key is pressed i.e. cv2.waitKey(0)
.
Putting all the code together now, we can detect texts in images effectively.
from pytesseract import * import cv2 def process_image(image_path): img = cv2.imread(image_path) rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) save_text = image_to_data(rgb, output_type=Output.DICT) for i in range(0, len(save_text["text"])): x = save_text["left"][i] y = save_text["top"][i] w = save_text["width"][i] h = save_text["height"][i] text = save_text["text"][i] confidence_level = int(save_text["conf"][i]) if confidence_level > 75: cv2.rectangle(img, (x, y), (x + w, y + h), (0, 0, 0), 2) (text_width, text_height), _ = cv2.getTextSize(text, cv2.FONT_HERSHEY_SIMPLEX, 0.6, 1) cv2.rectangle(img, (x, y - text_height - 5), (x + text_width, y), (255, 255, 255), -1) cv2.putText(img, text, (x, y - 5), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 0, 0), 1) return img if __name__ == "__main__": input_image_path = 'sample_img.png' processed_image = process_image(input_image_path) cv2.imshow("Image", processed_image) cv2.waitKey(0)
Let's take a look at the output of the above code below. We can see how a box is drawn around the text, and the detected text is written above it.
If you want to be able to copy the text once it is detected, you can print it on the terminal as well.
from pytesseract import * import cv2 def process_image(image_path): img = cv2.imread(image_path) rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) save_text = image_to_data(rgb, output_type=Output.DICT) for i in range(0, len(save_text["text"])): x = save_text["left"][i] y = save_text["top"][i] w = save_text["width"][i] h = save_text["height"][i] text = save_text["text"][i] confidence_level = int(save_text["conf"][i]) if confidence_level > 75: cv2.rectangle(img, (x, y), (x + w, y + h), (0, 0, 0), 2) (text_width, text_height), _ = cv2.getTextSize(text, cv2.FONT_HERSHEY_SIMPLEX, 0.6, 1) cv2.rectangle(img, (x, y - text_height - 5), (x + text_width, y), (255, 255, 255), -1) cv2.putText(img, text, (x, y - 5), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 0, 0), 1) print(f"Confidence: {confidence_level}") print(f"Text: {text}\n") return img if __name__ == "__main__": input_image_path = 'sample_img.png' processed_image = process_image(input_image_path) cv2.imshow("Image", processed_image) cv2.waitKey(0)
This is how the text is shown on the terminal, along with the confidence levels.
Using the same logic, we can even detect text in videos. This can be achieved by breaking down the video frame by frame and then applying the Tesseract detection on the frame. Due to the abrupt movements, this might not be as accurate as compared to when detecting text from images.
from pytesseract import * import cv2 def process_image(image): rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) save_text = image_to_data(rgb, output_type=Output.DICT) for i in range(0, len(save_text["text"])): x = save_text["left"][i] y = save_text["top"][i] w = save_text["width"][i] h = save_text["height"][i] text = save_text["text"][i] confidence_level = int(save_text["conf"][i]) if confidence_level > 75: cv2.rectangle(image, (x, y), (x + w, y + h), (0, 0, 0), 2) (text_width, text_height), _ = cv2.getTextSize(text, cv2.FONT_HERSHEY_SIMPLEX, 0.6, 1) cv2.rectangle(image, (x, y - text_height - 5), (x + text_width, y), (255, 255, 255), -1) cv2.putText(image, text, (x, y - 5), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 0, 0), 1) return image def process_video(video_path): video = cv2.VideoCapture(video_path) while video.isOpened(): ret, frame = video.read() if not ret: break processed_frame = process_image(frame) cv2.imshow('Video', processed_frame) if cv2.waitKey(1) == 27: break video.release() cv2.destroyAllWindows() if __name__ == "__main__": input_video_url = 'https://player.vimeo.com/external/581763177.sd.mp4?s=7c0e1dbf0a173ca1c9c3ac37a05c2498f905ad11&profile_id=165&oauth2_token_id=57447761' process_video(input_video_url)
Let's see how the text is detected frame by frame for our video. You can replace the URL and try it out on your videos!
Test your knowledge of text detection!
What does the image_to_data
function do in Tesseract?