Detour: The Perceptron Convergence Theorem
An Introduction to how Neurons can Learn
Introduction
In the last post, we started exploring neural nets, and in particular the architecture and function of Multi Layer Perceptrons (MLPs). I had said that the next post would be about how MLPs “learn” or get “trained”. However, we’re going to take a detour and jump back a step to consider a single neuron. Doing so will let us build our intuitions around more complicated learning later, while exploring a beautifully elegant piece of math, the Perceptron Convergence Theorem (PCT).
PCT states that a perceptron is guaranteed to find a solution classifying data into two separable parts in a finite number of step, if at least one solution exists. Working through and understanding the PCT will let us work through some linear algebra basics applicable to much more advanced machine learning concepts later. I will summarize the PCT based on my understanding from reading chapter 2 of Anil Ananthaswamy’s excellent Why Machines Learn. This is effectively a summary of the most technical parts of that chapter. We’ll cover:
The class of problem for which PCT applies
How a perceptron can learn and solve binary linear classification problems
A step-by-step proof for the PCT.
Let’s go!
Terminology and Problem Classes
Note that in the remainder of this post, we’re using the word “perceptron” slightly differently than how it was introduced in the last post. In the last post, “perceptron” was used in the modern context, where it means a layer of neurons. But the original use of the word and its use in PCT predates the concept of layers (single or multiple): instead, it refers to the ability of a single neuron to classify data and learn from its mistakes. So in this post, we’re using the term “perceptron” and “neuron” interchangeably. Since “Perceptron Convergence Theorem” literally has the word “perceptron” in the name, it’s best that we stick to using it here, even if the word means something different in most modern contexts. This means that the object of study for this post is not a network in the sense we learned about last time, but rather a single neuron. This neuron can take in arbitrary input and has only two possible outputs: 1 or -1.
The PCT applies for problems that are linearly separable in 2 parts. For a 2 dimensional problem, (i.e. where our input data has two properties), this means that we can draw a line through a plane that separates the data completely in two sections (hence, “linearly separable”). For a 3 dimensional problem where our input data has three properties, we can draw a plane through 3D space separating the data. And for an N dimensional problem containing input data with N properties, we can draw a hyperplane of N-1 size through the data.
We want our perceptron to take in some input, and tell us wether that input is One Class of Thing or Another Class of Thing. Each input is defined by a list of properties. We name these values x1, x2, x3, x4 … xn for n properties. Each input is therefore able to be arranged in a vector (since a vector is really a fancy name for a “list”, X (we’re using capital X to denote a vector here).
As we explored in the last post, the perceptron’s ability to classify each data point is fully determined by its weights and its bias. In particular, we want the weighted sum of all the vector’s components (and the single bias) to meaningfully tell us wether the thing represented by the vector is in One Class or The Other. Each value of X gets a weight for a total of n weights so an input with four properties would get w1, w2, w3 and w4. These too can be considered a vector, W.
We take the dot product of W with X. To get the dot product, we transpose W and then sum the product of each respective value from both vectors. The reason we need to transpose W is because we cannot take the dot product of two column vectors. More generally if one matrix is of dimensions m x n, then the other must have the dimension n x m. Normally this would be written as WTX with the ‘T’ in a superscript. In this post, we’re aiming to make our syntax as clear as possible at the potential expense of technical accuracy, so we’ll dispense with the T and write W.X, like so:
W.X = x1w1 + x2w2 + x3w3 + x4w4 + … xnwn
In the book, Ananthaswamy shows that one elegant way of treating the bias is to consider it a normal weight corresponding to an extra value of X, x0 - which always equals 1. Thus we have:
W.X = x0w0 + x1w1 + x2w2 + x3w3 + x4w4 + … xnwn , where x0 = 1
Next, let’s agree on some additional terms that we’ll use in the proof. We’ll label these L1 through L8 so that we can refer to them later.
L1: W - The weight vector we’ve discussed above. Initially, all values of W are set to 0.
L2: W* - This is an idealized weight vector that would solve the classification problem in each case when transposed with X. We don’t know the values of W*: it is what we are trying to get W to converge to through learning. Note that there are many possible values of W* (indeed, there can be infinitely many ways of separating our data correctly). PCT doesn’t tell us which version of W* we will arrive at, only that we will arrive at one of them. Since W* is a vector, it forms a hyperplane. The hyperplane that is orthogonal to W* is what will separate our data neatly into Cs and Ds.
L3: X - An input vector, described above.
L4: y -This is the output of the perceptron, either -1 or 1. These values represent C and D respectively. y is defined as follows:
y = {-1 if W.T <= 0, 1 if W.T > 0}
L5: γ - This is the greek letter “gamma”. γ is the distance between the linearly separating hyperplane orthogonal to W*, and the closest value of X for all inputs in the training data.
L6: Wnew - The value of W on any current update cycle
L7: Wold - The immediately previous value of W on any current update cycle
L8: M - The number of updates the algorithm takes for W to converge with W*. This is the value we are trying to prove is finite.
Proving the Perceptron Convergence Theorem
Our goal is to prove that there is a procedure which will update W such that it will always converge to some solution, W* in a finite number of steps.
We’ll describe that procedure next, and then we will prove that it works.
First, we normalize our input based on the magnitude of each vector.
We’re doing this because it makes the proof easier, but it isn’t strictly necessary.
To scale each input vector consistently, we’ll find the data point with the highest magnitude, and divide it by said magnitude. This makes it a unit vector of length
1(i.e, the magnitude ofW*)Then, we divide every other vector by the magnitude found in step 1b. This means that every input vector will be of some length between
0and1.From our terms, recall that
L5definesγas the vector closest to the plane orthogonal toW*. Therefore,0 < γ <= 1
Now, let’s define the procedure for updating
W. Observe that if an inputXis classified incorrectly, then the equation we defined inL4will be wrong! And since the only terms of the equation areXandW,Wmust be wrong!Here’s our equation again:
y = {-1 if W.T <= 0, 1 if W.T > 0}Notice that if we multiply
ybyW.T, we will get a positive number if and only if the input was sorted correctlyIf
yis-1andW.Tis greater than0, theny(W.T)will be negativeif y is
1andW.Tis lesser than 0, theny(W.T)will also be negative
This negative value signals that
Whas some (or all) incorrect weights.Then we update W as follows: take the old value of W, and add the value of yX to it:
Wnew = Wold + yXif
yis positive, then the linearly separating hyperplane is too restrictive:Xneeds to be added toWif
yis negative, that means the hyperplane is too inclusive:Xneeds to be subtracted fromW
Next, notice that as
Wgets closer toW*, their dot productW.W*will increase. This is because the closer two vectors are to pointing in the same direction, the larger they will be.W.W*will be largest whenW = W*. This is in an important sense what the dot product measures: the degree to which two vectors point in the same direction.But, getting
Wto approachW*isn’t the only way their dot product can increase. Additionally,W.W*increases ifWgrows in magnitude due to step 2dSo we want a way of ensuring that
W.W*grows because of a change inW’s direction, not a change in its magnitudeWe can check for the rate of growth of
W’s magnitude by checkingW.W. SinceWalways points in the same direction as itself, any change inW.Wmust be because ofW’s magnitude!Therefore, we need
W.W*to grow at a faster rate thanW.W. We want the lower bound ofW.W*’s growth rate to be greater than the upper bound ofW.W’s
We need to understand exactly how
W.W*andW.Wgrow after each update. Let’s start withW.W*.Start with
Wnew.W*Wnew.W* = Wold + yX.W*[substitute the definition ofWnewfrom step 2d]= Wold.W* + yX.W*[expansion]For any two vectors of the same dimension
AandB,the dot productsA.B = B.ATherefore
yX.W* = yW*.X[apply the second operand in step 4c to the general form in step 4d]W*is by definition sorts correctly, soyW*.X > 0[apply step 2b to step 4e]Recall,
W*.Xis the length of the distance betweenXand the separating hyperplaneSo
yW*.Xis greater than0, but also greater than or equal toγ, sinceγis defined as the shortest distance for all inputs.This means that
Wnew.W*must be greater than or equal toWold.W* + γWe’ve now set a lower bound on the growth of
W.W*. It grows by at least the value ofγwith every update!
Now that we’ve set a lower bound on the growth of
W.W*, let’s turn our attention to W.W.Start with
Wnew.WnewWnew.Wnew = (Wold + yX).(Wold + yX)[substitution from step 2d]= (Wold + yX).Wold + (Wold + yX).yX[expansion]= Wold.Wold + yX.Wold + yWold.X + y^2X.X[expansion again]X.Wold = Wold.X[Same motion as step 4e]Therefore
Wnew.Wnew = Wold.Wold + 2yWold.X + y^2X.X
We now have two new terms,
2yWold.Xandy^2X.X. The next set of steps helps us understand the role they play in the equation.2yWold.X <= 0[becauseyWold.X <= 0for any update, by definition]y^2 = 1[becausey = 1 | -1]X.X <= 1[because the magnitude ofXis less than or equal to0due to the normalization in step 1]Applying steps 6a through 6c to the equation in step 5f, we have the inequality
Wnew.Wnew <=Wold.Wold + X.XPut another way, the difference between
Wnew.WnewandWold.Woldmust be less than or equal to1We’ve now set an upper bound on the growth of
W.W: it grows by at most1with every update!
We’ve established the minimum and maximum bounds of growth on
W.W*andW.Wrespectively. Now we can finally prove that the number of steps forWto converge toW*is necessarily finite.W.W*andW.Weach start at0[by definition]W.W*grows by at leastγwith every update [step 4j]Therefore after
Mupdates,W.W*grows by at leastMγ.I.e.
W.W* >= MγFlipping this around,
Mγ <= W.W*Mγ <= ||W||W*||cosθ[i.e. the trigonometric definition of the dot product]So
Mγ <= ||W||[becausecosθis less than or equal to1, and||W*||is defined as1]
Notice that we’ve reduced the relationship between
MandW.W*to a relationship betweenMandW!W.W <= M[step 6e saysW.Wgrows by at most1, butMiterates by exactly1with each update by definition]||W|| = √(W.W)[by definition]Mγ <= √(W.W)[apply steps 7g with 8b]Mγ <= √M[substation with step 8a](M^2)(γ^2) <= M[square both sides]Mγ^2 <= 1[divide both sides byM]M <= 1/γ^2[divide both sides byγ^2]
The number of updates M must be less than or equal to 1/γ^2. This means that M must be finite!

