Introduction to Perceptron
Explore the perceptron's function as a binary classifier within supervised learning models. Understand how perceptrons process inputs, calculate weighted sums, and classify data, while examining their historical significance and why more advanced methods like multilayer perceptrons are necessary. This lesson helps you grasp the basics behind perceptron architecture and prepares you for deeper study in neural networks.
We'll cover the following...
In the previous chapters, we designed a complex program with a lot of detail. It’s time to take a step back and enjoy the big picture.
In the first part of this course, we built a supervised learning system based on a specific architecture called the perceptron. By the end of this chapter, we’ll know what a perceptron looks like and what it can do. Moreover, we’ll also explore what it cannot do, and why we must move forward to more sophisticated algorithms such as neural networks.
We’ll also learn about the history of the perceptron. It won’t be a boring history lesson, but rather, this will explain a clash of ideas that impacted much of what we know about computers.
What is a perceptron?
To understand what a perceptron is, let’s look back at the binary classifier we discussed in the Getting Real. That program sorted MNIST characters into two classes: “” or “not a ” The following picture shows one way to understand it:
This diagram tracks an MNIST image through the system. The process begins with the input variables, from to . In the case of MNIST, the input variables are the pixels of an image. To those, we add a bias , with a constant value of . Further, we color it with a darker shade of gray, to make it stand apart from other input variables.
The next step, the yellow square, is the weighted sum of the input variables. It’s implemented as a multiplication of matrices, so it is marked with the “dot” sign.
The weighted sum flows through one last function—the light blue square. In general, this is called the activation function, and can be different for different learning systems. In our system, we use a sigmoid. The output of the sigmoid is the predicted label , ranging from to .
During training, the system compares the prediction with the ground truth to calculate the next step of gradient descent. During classification, it snaps the value of to one of its extremes, either or , meaning “” or “not a ,” respectively.
The architecture explained above is the perceptron, the original supervised learning system.
Note: Since perceptron does not use gradient descent, the
does not need an activation function with a smooth gradient such as the sigmoid. Instead, it can get away with a simple function that snaps the value of the weighted sum to either or depending on whether it’s positive or negative. vanilla perceptron Multilayer perceptrons are sometimes colloquially referred to as “vanilla” perceptron
Perceptrons assembly basics
The perceptron is a great building block for more complex systems. It means we have been assembling multiple perceptrons since the very beginning. Let’s understand this concept.
During training, our system reads all the examples together, rather than reading one example at a time. In a way, that operation is like stacking multiple perceptrons, sending one example to each perceptron, and then collecting all the outputs into a matrix:
We used ten matrix columns to classify ten digits, each dedicated to classifying one digit against all the others. In The Final Challenge, we also assembled perceptrons in a different way. A perceptron is a binary classifier. It classifies things as either or We use ten matrix columns to classify ten digits, each dedicated to classify one digit against all the others. Conceptually, that’s like using ten perceptrons in parallel as shown here:
Each parallelized perceptron classifies one class, from to During classification, we pick the class that outputs the most confident prediction.
So we stack perceptrons and parallelize perceptrons. We have done it with matrix operations in both cases, which is easier and faster than running the same classifier multiple times, once per example, and then once per class.
One more way to combine perceptrons is to serialize them, using the output of one perceptron as input to the next. The result is called a multilayer perceptron. We have not used multilayer perceptrons yet, but we’ll use them in the next lessons. For now, let’s keep this idea at the back of our minds.