Search⌘ K
AI Features

Introduction to Perceptron

Explore the perceptron's function as a binary classifier within supervised learning models. Understand how perceptrons process inputs, calculate weighted sums, and classify data, while examining their historical significance and why more advanced methods like multilayer perceptrons are necessary. This lesson helps you grasp the basics behind perceptron architecture and prepares you for deeper study in neural networks.

In the previous chapters, we designed a complex program with a lot of detail. It’s time to take a step back and enjoy the big picture.

In the first part of this course, we built a supervised learning system based on a specific architecture called the perceptron. By the end of this chapter, we’ll know what a perceptron looks like and what it can do. Moreover, we’ll also explore what it cannot do, and why we must move forward to more sophisticated algorithms such as neural networks.

We’ll also learn about the history of the perceptron. It won’t be a boring history lesson, but rather, this will explain a clash of ideas that impacted much of what we know about computers.

What is a perceptron?

To understand what a perceptron is, let’s look back at the binary classifier we discussed in the Getting Real. That program sorted MNIST characters into two classes: “55” or “not a 5.5.” The following picture shows one way to understand it:

This diagram tracks an MNIST image through the system. The process begins with the input variables, from x1x_1 to xnx_n. In the case of MNIST, the input variables are the 784784 pixels of an image. To those, we add a bias x0x_0, with a constant value of 11. Further, we color it with a darker shade of gray, to make it stand apart from other input variables.

The next step, the yellow square, is the weighted sum of the input variables. It’s implemented as a multiplication of matrices, so it is marked with the “dot” sign.

The weighted sum flows through one last function—the light blue square. In general, this is called the activation function, and can be different for different learning systems. In our system, we use a sigmoid. The output of the sigmoid is the predicted label y^\hat{y}, ranging from 00 to 11.

During training, the system compares the prediction y^\hat{y} with the ground truth to calculate the next step of gradient descent. During classification, it snaps the value of y^\hat{y} to one of its extremes, either 11 or 00, meaning “55” or “not a 55,” respectively.

The architecture explained above is the perceptron, the original supervised learning system.

Note: Since perceptron does not use gradient descent, the vanilla perceptronMultilayer perceptrons are sometimes colloquially referred to as “vanilla” perceptron does not need an activation function with a smooth gradient such as the sigmoid. Instead, it can get away with a simple function that snaps the value of the weighted sum to either 11 or 0,0, depending on whether it’s positive or negative.

Perceptrons assembly basics

The perceptron is a great building block for more complex systems. It means we have been assembling multiple perceptrons since the very beginning. Let’s understand this concept.

During training, our system reads all the examples together, rather than reading one example at a time. In a way, that operation is like stacking multiple perceptrons, sending one example to each perceptron, and then collecting all the outputs into a matrix:

We used ten matrix columns to classify ten digits, each dedicated to classifying one digit against all the others. In The Final Challenge, we also assembled perceptrons in a different way. A perceptron is a binary classifier. It classifies things as either 00 or 1.1. We use ten matrix columns to classify ten digits, each dedicated to classify one digit against all the others. Conceptually, that’s like using ten perceptrons in parallel as shown here:

Each parallelized perceptron classifies one class, from 00 to 9.9. During classification, we pick the class that outputs the most confident prediction.

So we stack perceptrons and parallelize perceptrons. We have done it with matrix operations in both cases, which is easier and faster than running the same classifier multiple times, once per example, and then once per class.

One more way to combine perceptrons is to serialize them, using the output of one perceptron as input to the next. The result is called a multilayer perceptron. We have not used multilayer perceptrons yet, but we’ll use them in the next lessons. For now, let’s keep this idea at the back of our minds.