Applying the learning algorithm to a single-layer perceptron

Explore the math of applying the learning algorithm to a single layer perceptron

Daniel Patrick

Sep 09, 2024

Gradio app for the visualization of the learning algorithm for a single-layer perceptron

This article is part of a series:

The math behind a single-layer perceptron by representing the OR operator
Applying and visualizing the learning algorithm of a single-layer perceptron by representing an OR operator

This is the perceptron algorithm we used in the first article with a static bias b:

{\displaystyle f(\mathbf {x} )=h(\mathbf {w} \cdot \mathbf {x} +b)}

To apply the learning algorithm, we need to treat the bias b as an additional weight.

In this way we include the bias b in the learning algorithm and we don’t have to assign a value manually.

Since w ⋅ x + b = (w, b) ⋅ (x, 1) we can treat the bias b just as an additional weight.

So instead of only two weights and two inputs, we add an additional weight that represents b and an additional input which always will be 1.

Now, we have the following equation:

{\displaystyle f(\mathbf {x} )=h(\mathbf {w} \cdot \mathbf {x})}

Where:

\mathbf{w} = \begin{bmatrix} b \\ w_1 \\ w_2 \end{bmatrix}

x is now:

\displaystyle \mathbf{x} = \begin{bmatrix} 1 \\ x_1 \\ x_2 \end{bmatrix}

The first input x0 will always be 1, w0 is now effectively the bias b.

The training data

In order to train our single-layer perceptron we need to understand the structure of our training data.

D is the training set of s samples:

\displaystyle D = \{(\mathbf{x}_1, d_1), \dots, (\mathbf{x}_s, d_s) \}

Where:

\mathbf{x}_j \text{ is the } n\text{-dimensional input vector}

\displaystyle d_{j} \text{ is the desired output}

To access the value of the i-th feature of the j-th training input vector:

The 1st feature of the j-th training input vector is always 1:

Again, because of that, w0 effectively is the bias that we use instead of the constant bias b.

The weights w are accessible by i which is the weight of the i-th input x:

\displaystyle w_{i} \text{ is the } i\text{-th value in the weight vector}

To show the time-dependence we will use:

w_{i}(t) \text{ is the weight } i \text{ at time } t.

Knowing the structure of the training data D, we can now assign the inputs x and the desired outputs d which are representing the OR operator:

\displaystyle \mathbf{x}_1 = [0, 0] \text{ and } d_1 = 0

\displaystyle \mathbf{x}_2 = [0, 1] \text{ and } d_2 = 1

\displaystyle \mathbf{x}_3 = [1, 0] \text{ and } d_3 = 1

\displaystyle \mathbf{x}_4 = [1, 1] \text{ and } d_4 = 1

So our training data D is:

D = \{ (\mathbf{x}_1, d_1), (\mathbf{x}_2, d_2), (\mathbf{x}_3, d_3), (\mathbf{x}_4, d_4) \}

and

I will initialize the weights with the following values (they also could be initialized with random values):

\mathbf{w} = \left[ \begin{array}{r} 1 \\ -1 \\ -1 \end{array} \right]

Calculating the error

The learning algorithm has two steps that can be performed by predetermined number of iterations or until the error is under a user-specified threshhold γ that can be calculated with this formula:

{\displaystyle \gamma={\frac {1}{s}}\sum _{j=1}^{s}|d_{j}-y_{j}(t)|}

Let’s say we have the following inputs and outputs where y is the output of the perceptron at the given time t:

\displaystyle y_1(t) = 0 \text{ for } \mathbf{x}_1 = [0, 0]

\displaystyle y_2(t) = 1 \text{ for } \mathbf{x}_2 = [0, 1]

\displaystyle y_4(t) = 0 \text{ for } \mathbf{x}_4 = [1, 1]

Obviously, the forth output is wrong and the others are correct.

When we plug the values in the error function we get the following value for γ:

\gamma = \frac{1}{4} \left( |0 - 0| + |1 - 1| + |1 - 1| + |1 - 0| \right) = 0.25

We will get the errors (γ or gamma) as a percentage.

In this case the quarter of the sum of the absolute values of the desired outputs d subtracted by predicted outputs y.

That means we have an error rate of 25%. In our case we want γ to be 0. So we would need to repeat the learning algorithm until γ = 0.

Steps of the learning algorithm

For each example in the training set D we need to perform the following two steps:
1. Calculate the output:
  ${\begin{aligned}y_{j}(t)&=h[\mathbf {w} (t)\cdot \mathbf {x} _{j}]\\&=h[w_{0}(t)x_{j,0}+w_{1}(t)x_{j,1}+w_{2}(t)x_{j,2}+\dotsb +w_{n}(t)x_{j,n}]\end{aligned}}$
2. Update the weights:
  $w_{i}(t+1)=w_{i}(t)\;{{+}}\;r\cdot (d_{j}-y_{j}(t))x_{j,i}$
Calculate the iteration error:
$\gamma = {\frac {1}{s}}\sum _{j=1}^{s}|d_{j}-y_{j}(t)|$
When the error γ > 0 go to step 1.

Keep in mind that this is the threshold for our use case and a different error threshold can be specified by user or the requirements of the application. In our case we want it to be 0.

So, knowing all of this we can finally start to do the math for the first epoch.

The calculation

Let’s summarize what we have.

These are the features and desired outputs of the OR operator:

We do also know that the first feature of our training vectors is always 1:

So this is our training set D:

Our weights w:

The learning rate r, which I will set to 0.1:

And we start with time t = 0:

We will also use the Heaviside step function as the activation function:

\displaystyle h(x) := \begin{cases} 1, & \text{if } x \geq 0 \\ 0, & \text{if } x < 0 \end{cases}

Now we have all the information gathered that is required.

We can start to calculate:

First sample of Training set
1. Calculate the output for first sample:
  ${\begin{aligned} y_{1}(t) &= h\left[1 \cdot 1 + (-1) \cdot 0 + (-1) \cdot 0\right] \\ &= h[1] = 1 \end{aligned}}$
2. Update the weights, since output is not the desired output:
  ${\begin{aligned} w_{0}(t+1) &= 1 + 0.1 \cdot (0 - 1) \cdot 1 \\ &= 1 - 0.1 = 0.9 \end{aligned}}} \\ {\displaystyle {\begin{aligned} w_{1}(t+1) &= -1 + 0.1 \cdot (0 - 1) \cdot 0 \\ &= -1 \end{aligned}}} \\ {\displaystyle {\begin{aligned} w_{2}(t+1) &= -1 + 0.1 \cdot (0 - 1) \cdot 0 \\ &= -1 \end{aligned}}$
  Updated weights are w = [0.9, -1, -1].
Second sample:
1. Calculate output:
  ${\begin{aligned} y_{2}(t) &= h\left(0.9 \cdot 1 + (-1) \cdot 0 + (-1) \cdot 1\right) \\ &= h(-0.1) = 0 \end{aligned}}$
2. Update the weights, since output is not the desired output:
  ${\begin{aligned} w_{0}(t+1) &= 0.9 + 0.1 \cdot (1 - 0) \cdot 1 \\ &= 0.9 + 0.1 = 1.0 \end{aligned}}} \\ {\displaystyle {\begin{aligned} w_{1}(t+1) &= -1 + 0.1 \cdot (1 - 0) \cdot 0 \\ &= -1 \end{aligned}}} \\ {\displaystyle {\begin{aligned} w_{2}(t+1) &= -1 + 0.1 \cdot (1 - 0) \cdot 1 \\ &= -1 + 0.1 = -0.9 \end{aligned}}$
  Updated weights are w = [1.0, -1, -0.9].
Third sample:
1. Calculate output:
  ${\begin{aligned} y_{3}(t) &= h\left(1.0 \cdot 1 + (-1) \cdot 1 + (-0.9) \cdot 0\right) \\ &= h\left(1.0 - 1 + 0\right) \\ &= h(0) = 1 \end{aligned}}$
2. Nothing changes, the output is the desired output.
  
  Therefore we don’t need to calculate weights. It’s possible to calculate them anyway, but it won’t change the weights.
Fourth sample:
1. Calculate output:
  ${\begin{aligned} y_{4}(t) &= h\left(1.0 \cdot 1 + (-1) \cdot 1 + (-0.9) \cdot 1\right) \\ &= h\left(1.0 - 1 - 0.9\right) \\ &= h(-0.9) = 0 \end{aligned}}$
2. Update weights, since output is not desired output:
  ${\begin{aligned} w_{0}(t+1) &= 1.0 + 0.1 \cdot (1 - 0) \cdot 1 \\ &= 1.0 + 0.1 = 1.1 \end{aligned}}} \\ {\displaystyle {\begin{aligned} w_{1}(t+1) &= -1 + 0.1 \cdot (1 - 0) \cdot 1 \\ &= -1 + 0.1 = -0.9 \end{aligned}}} \\ {\displaystyle {\begin{aligned} w_{2}(t+1) &= -0.9 + 0.1 \cdot (1 - 0) \cdot 1 \\ &= -0.9 + 0.1 = -0.8 \end{aligned}}$
  Updated weights are w = (1.1, -0.9, -0.8).
Calculate the error:
${\begin{aligned} \gamma &= \frac{1}{4} \left(1 + 0 + 0 + 1\right) \\ &= \frac{1}{4} \times 2 \\ &= 0.5 \end{aligned}}$
The error is 50% and the first epoch is calculated. These steps have to be repeated until error is 0%.

Usually these steps are computed and that is exactly what we will do for the next epochs.

I built an Gradio application with Python that does the work and also visualizes some metrics for us.

Visualization

Here the visualization of the final decision boundary, the error rate over epochs, and the weight history.

Graphs: Decision border. Error rate over epochs. Weight change over epochs.

Try the Gradio App

Try the app here: Gradio App for the visualization of the learning algorithm of a single-layer perceptron

A screenshot of the app where you can see the weight change over the epochs, the error rate over the epochs, and the final decision border of the perceptron:

Daniel’s Substack

Discussion about this post

Ready for more?