Detour: The Perceptron Convergence Theorem

An Introduction to how Neurons can Learn

Dec 17, 2025

Introduction

In the last post, we started exploring neural nets, and in particular the architecture and function of Multi Layer Perceptrons (MLPs). I had said that the next post would be about how MLPs “learn” or get “trained”. However, we’re going to take a detour and jump back a step to consider a single neuron. Doing so will let us build our intuitions around more complicated learning later, while exploring a beautifully elegant piece of math, the Perceptron Convergence Theorem (PCT).

PCT states that a perceptron is guaranteed to find a solution classifying data into two separable parts in a finite number of step, if at least one solution exists. Working through and understanding the PCT will let us work through some linear algebra basics applicable to much more advanced machine learning concepts later. I will summarize the PCT based on my understanding from reading chapter 2 of Anil Ananthaswamy’s excellent Why Machines Learn. This is effectively a summary of the most technical parts of that chapter. We’ll cover:

The class of problem for which PCT applies
How a perceptron can learn and solve binary linear classification problems
A step-by-step proof for the PCT.

Let’s go!

Terminology and Problem Classes

Note that in the remainder of this post, we’re using the word “perceptron” slightly differently than how it was introduced in the last post. In the last post, “perceptron” was used in the modern context, where it means a layer of neurons. But the original use of the word and its use in PCT predates the concept of layers (single or multiple): instead, it refers to the ability of a single neuron to classify data and learn from its mistakes. So in this post, we’re using the term “perceptron” and “neuron” interchangeably. Since “Perceptron Convergence Theorem” literally has the word “perceptron” in the name, it’s best that we stick to using it here, even if the word means something different in most modern contexts. This means that the object of study for this post is not a network in the sense we learned about last time, but rather a single neuron. This neuron can take in arbitrary input and has only two possible outputs: 1 or -1.

The PCT applies for problems that are linearly separable in 2 parts. For a 2 dimensional problem, (i.e. where our input data has two properties), this means that we can draw a line through a plane that separates the data completely in two sections (hence, “linearly separable”). For a 3 dimensional problem where our input data has three properties, we can draw a plane through 3D space separating the data. And for an N dimensional problem containing input data with N properties, we can draw a hyperplane of N-1 size through the data.

We want our perceptron to take in some input, and tell us wether that input is One Class of Thing or Another Class of Thing. Each input is defined by a list of properties. We name these values x1, x2, x3, x4 … xn for n properties. Each input is therefore able to be arranged in a vector (since a vector is really a fancy name for a “list”, X (we’re using capital X to denote a vector here).

As we explored in the last post, the perceptron’s ability to classify each data point is fully determined by its weights and its bias. In particular, we want the weighted sum of all the vector’s components (and the single bias) to meaningfully tell us wether the thing represented by the vector is in One Class or The Other. Each value of X gets a weight for a total of n weights so an input with four properties would get w1, w2, w3 and w4. These too can be considered a vector, W.

We take the dot product of W with X. To get the dot product, we transpose W and then sum the product of each respective value from both vectors. The reason we need to transpose W is because we cannot take the dot product of two column vectors. More generally if one matrix is of dimensions m x n, then the other must have the dimension n x m. Normally this would be written as WTX with the ‘T’ in a superscript. In this post, we’re aiming to make our syntax as clear as possible at the potential expense of technical accuracy, so we’ll dispense with the T and write W.X, like so:

W.X = x1w1 + x2w2 + x3w3 + x4w4 + … xnwn

In the book, Ananthaswamy shows that one elegant way of treating the bias is to consider it a normal weight corresponding to an extra value of X, x0 - which always equals 1. Thus we have:

W.X = x0w0 + x1w1 + x2w2 + x3w3 + x4w4 + … xnwn , where x0 = 1

Next, let’s agree on some additional terms that we’ll use in the proof. We’ll label these L1 through L8 so that we can refer to them later.

L1: W - The weight vector we’ve discussed above. Initially, all values of W are set to 0.

L2: W* - This is an idealized weight vector that would solve the classification problem in each case when transposed with X. We don’t know the values of W*: it is what we are trying to get W to converge to through learning. Note that there are many possible values of W* (indeed, there can be infinitely many ways of separating our data correctly). PCT doesn’t tell us which version of W* we will arrive at, only that we will arrive at one of them. Since W* is a vector, it forms a hyperplane. The hyperplane that is orthogonal to W* is what will separate our data neatly into Cs and Ds.

L3: X - An input vector, described above.

L4: y -This is the output of the perceptron, either -1 or 1. These values represent C and D respectively. y is defined as follows:

y = {-1 if W.T <= 0, 1 if W.T > 0}

L5: γ - This is the greek letter “gamma”. γ is the distance between the linearly separating hyperplane orthogonal to W*, and the closest value of X for all inputs in the training data.

L6: Wnew - The value of W on any current update cycle

L7: Wold - The immediately previous value of W on any current update cycle

L8: M - The number of updates the algorithm takes for W to converge with W*. This is the value we are trying to prove is finite.

Proving the Perceptron Convergence Theorem

Our goal is to prove that there is a procedure which will update W such that it will always converge to some solution, W* in a finite number of steps.

We’ll describe that procedure next, and then we will prove that it works.

First, we normalize our input based on the magnitude of each vector.
1. We’re doing this because it makes the proof easier, but it isn’t strictly necessary.
2. To scale each input vector consistently, we’ll find the data point with the highest magnitude, and divide it by said magnitude. This makes it a unit vector of length 1 (i.e, the magnitude of W*)
3. Then, we divide every other vector by the magnitude found in step 1b. This means that every input vector will be of some length between 0 and 1.
4. From our terms, recall that L5 defines γ as the vector closest to the plane orthogonal to W* . Therefore, 0 < γ <= 1
Now, let’s define the procedure for updating W. Observe that if an input X is classified incorrectly, then the equation we defined in L4 will be wrong! And since the only terms of the equation are X and W, W must be wrong!
1. Here’s our equation again: y = {-1 if W.T <= 0, 1 if W.T > 0}
2. Notice that if we multiply y by W.T, we will get a positive number if and only if the input was sorted correctly
  1. If y is -1 and W.T is greater than 0, then y(W.T) will be negative
  2. if y is 1 and W.T is lesser than 0, then y(W.T) will also be negative
3. This negative value signals that W has some (or all) incorrect weights.
4. Then we update W as follows: take the old value of W, and add the value of yX to it: Wnew = Wold + yX
  1. if y is positive, then the linearly separating hyperplane is too restrictive: X needs to be added to W
  2. if y is negative, that means the hyperplane is too inclusive: X needs to be subtracted from W
Next, notice that as W gets closer to W*, their dot product W.W* will increase. This is because the closer two vectors are to pointing in the same direction, the larger they will be. W.W* will be largest when W = W*. This is in an important sense what the dot product measures: the degree to which two vectors point in the same direction.
1. But, getting W to approach W* isn’t the only way their dot product can increase. Additionally, W.W* increases if W grows in magnitude due to step 2d
2. So we want a way of ensuring that W.W* grows because of a change in W’s direction, not a change in its magnitude
3. We can check for the rate of growth of W’s magnitude by checking W.W . Since W always points in the same direction as itself, any change in W.W must be because of W’s magnitude!
4. Therefore, we need W.W* to grow at a faster rate than W.W . We want the lower bound of W.W*’s growth rate to be greater than the upper bound of W.W’s
We need to understand exactly how W.W* and W.W grow after each update. Let’s start with W.W*.
1. Start with Wnew.W*
2. Wnew.W* = Wold + yX.W* [substitute the definition of Wnew from step 2d]
3. = Wold.W* + yX.W* [expansion]
4. For any two vectors of the same dimension A and B, the dot products A.B = B.A
5. Therefore yX.W* = yW*.X [apply the second operand in step 4c to the general form in step 4d]
6. W* is by definition sorts correctly, so yW*.X > 0 [apply step 2b to step 4e]
7. Recall, W*.X is the length of the distance between X and the separating hyperplane
8. So yW*.X is greater than 0, but also greater than or equal to γ, since γ is defined as the shortest distance for all inputs.
9. This means that Wnew.W* must be greater than or equal to Wold.W* + γ
10. We’ve now set a lower bound on the growth of W.W*. It grows by at least the value of γ with every update!
Now that we’ve set a lower bound on the growth of W.W*, let’s turn our attention to W.W.
1. Start with Wnew.Wnew
2. Wnew.Wnew = (Wold + yX).(Wold + yX) [substitution from step 2d]
3. = (Wold + yX).Wold + (Wold + yX).yX [expansion]
4. = Wold.Wold + yX.Wold + yWold.X + y^2X.X [expansion again]
5. X.Wold = Wold.X [Same motion as step 4e]
6. Therefore Wnew.Wnew = Wold.Wold + 2yWold.X + y^2X.X
We now have two new terms, 2yWold.X and y^2X.X . The next set of steps helps us understand the role they play in the equation.
1. 2yWold.X <= 0 [because yWold.X <= 0 for any update, by definition]
2. y^2 = 1 [because y = 1 | -1]
3. X.X <= 1 [because the magnitude of X is less than or equal to 0 due to the normalization in step 1]
4. Applying steps 6a through 6c to the equation in step 5f, we have the inequality Wnew.Wnew <= Wold.Wold + X.X
5. Put another way, the difference between Wnew.Wnew and Wold.Wold must be less than or equal to 1
6. We’ve now set an upper bound on the growth of W.W: it grows by at most 1 with every update!
We’ve established the minimum and maximum bounds of growth on W.W* and W.W respectively. Now we can finally prove that the number of steps for W to converge to W* is necessarily finite.
1. W.W* and W.W each start at 0 [by definition]
2. W.W* grows by at least γ with every update [step 4j]
3. Therefore after M updates, W.W* grows by at least Mγ.
4. I.e. W.W* >= Mγ
5. Flipping this around, Mγ <= W.W*
6. Mγ <= ||W||W*||cosθ [i.e. the trigonometric definition of the dot product]
7. So Mγ <= ||W|| [because cosθ is less than or equal to 1, and ||W*|| is defined as 1]
Notice that we’ve reduced the relationship between M and W.W* to a relationship between M and W!
1. W.W <= M [step 6e says W.W grows by at most 1, but M iterates by exactly 1 with each update by definition]
2. ||W|| = √(W.W) [by definition]
3. Mγ <= √(W.W) [apply steps 7g with 8b]
4. Mγ <= √M [substation with step 8a]
5. (M^2)(γ^2) <= M [square both sides]
6. Mγ^2 <= 1 [divide both sides by M]
7. M <= 1/γ^2 [divide both sides by γ^2]

The number of updates M must be less than or equal to 1/γ^2. This means that M must be finite!

Harbinger Security

Discussion about this post

Ready for more?