Neural network
with learning by backward error propagation
Colin Fahey

A biological neural network
1. Software
2. Introduction
This document describes how to implement an artificial neural network that is capable of being trained to recognize patterns.
This document describes a model of a neural network that learns by an algorithm that uses "backward error propagation".
This document includes basic demonstrations of learning by "backward error propagation". This document has a link to computer code. The computer code includes the demonstrations. The computer code can be used to create complex neural networks. However, the computer code is only for demonstration purposes. An alternative implementation could reduce memory usage and could increase speed.
3. Alternative to learning by backward error propagation
This document describes a model of a neural network which learns by an algorithm named "backward error propagation". This algorithm can require a very long time to learn various lessons. Also, this algorithm can randomly fail to learn various lessons due to the random initial status of the neural network before training.
Learning by "associating active inputs" is an important alternative to learning by "backward error propagation". Learning by associating active inputs simply associates inputs that are simultaneously active. Such learning can be fast and reliable. However, for many practical purposes, there is no obvious way to use a neural network that learns by association, whereas there is an obvious way to use a network that learns by backward error propagation.
Some biological neural networks are known to learn by association of active inputs. Backward error propagation has not been observed in any biological neural network.
This document describes interesting uses for a neural network that learns by backward error propagation. However, learning by association is a very important alternative algorithm for learning. Designing a neural network that learns by association to solve a particular problem might be more difficult that desgining an alternative neural network that learns by backward error propagation, but biological systems learn by association, and the learning ability of biological systems is evident.
4. Biological neuron
4.1 Neuron cell

A biological neuron ("multipolar" type, ~4 um cell body)
A neuron is type of cell that has the ability to receive and transmit nerve signals.
Neurons are the basis of nerve systems, found in animals, birds, fish, and insects.
Brains with memory and logic, and simple reflex systems, are both based on arrangements of neurons.
Neurons are also used to convey signals over long distances in a creature's body, such as from sensors to the brain, or from the brain to muscles.
The behavior of a biological neuron is very complex, but the following simplified description captures the basic principle:
The neuron accumulates signals received from other neurons, and if the total signal accumulation exceeds a threshold, the neuron transmits its own signals to other neurons.
4.2 Neuron parts

Parts of a biological neuron
|
Soma
|
The cell body of a neuron
|
|
Dendrites
|
Fibers with chemical receptors (inputs) that extend from the cell body of a neuron. A neuron typically has many dendrites, and dendrites often have many branches.
|
|
Axon
|
A fiber with chemical emitters (outputs) at its endpoint that extends from the cell body of the neuron. A neuron has a single axon, and the axon usually has very few branches.
|
|
Synapse
|
A configuration such that the axon of one neuron and the dendrites of another neuron are separated by a very small gap. In such a configuration, chemicals emitted by an axon of a neuron cross the synapse and are received by the dendrites of the other neuron. This is how neurons influence other neurons.
|
4.3 Neuron firing
A neuron accumulates chemical signals from its dendrites, and if the total chemical accumulation exceeds a threshold within a period of time, the neuron "fires", sending its own signal through its axon.
Some neurons are capable of firing pulses on the order of 100 Hz.
The signals passing through neurons involve accumulations of Sodium (Na), Potassium (K), and Chlorine (Cl) ions, and a resulting electrochemical potential (i.e., voltage).
The resting voltage (-70 mV) and firing voltage (+30 mV) can be measured or even influenced by conventional electrical circuitry.
The following is a voltage recording of a rat neuron firing at a rate of roughly 100 Hz when a single whisker is touched and held out of its resting position:

A rat neuron firing (100 Hz) due to holding a whisker.
Although the stimulus is constant, the neuron signal is rapid pulsing.
4.4 Neural network
The human brain has approximately 10^11 (100 billion) neurons.
Each neuron in the cerebellum receives input from as many as 10^4 (10000) synapses.
Although the axon and dendrites of a neuron often extend only a few micrometers away from the cell body, some axons are on the order of a meter in length.
A brain has neurons with relatively short axons grouped in areas or clusters.
A brain also has bundles of neurons with relatively long axons to link areas separated by centimeters.
Thus a hierarchical network of processing elements is formed.
4.5 Neural network status
The status of a network of neurons is both the way the neurons are connected and the signals at all of the synapses.
It is unclear how much status information would be lost if a brain was tranquilized in to total inactivity for any amount of time.
One can imagine information sustained only by signals moving through the network, and not by network connectivity itself, like cellular automata simulations like Conway's "Game of Life", simple Dynamic Random Access Memory (DRAM) chips, and echoes in a chamber.
4.6 Neural network Learning
Conventional learning occurs when the properties of dendrites change at a synapse to become more or less efficient at receiving chemical signals from an axon.
The reasons for such changes are complicated, but the result is that a neuron requires a different combination of synapse inputs to trigger an output signal.
5. Artificial neuron
5.1 Definition
An "artificial neuron" is an algorithm or a physical device that implements a mathematical model inspired by the basic behavior of a biological neuron.
A neuron accumulates signals received from other neurons or inputs (e.g., sensors), and if the total signal accumulation exceeds a threshold, the neuron transmits a signal to other neurons or outputs (e.g., effectors).
Any mathematical model that incorporates the idea of accumulating multiple inputs and yielding a single output (that accentuates the relative intensity of the input relative to some nominal level) can be used for pattern recognition.
Such models can be the basis of an artificial neuron.
If the influence of each input can be modified, then the model can support learning.
5.2 Activation function
An "activation function" is a mathematical function that converts input values below a particular value to a relatively low output value, and converts input values above a particular value to a relatively high output value.
An "activation function" is used to convert the weighted sum of input values of a neuron to a value that represents the output of the neuron.
A "sigmoid" function is a general class of smooth functions that asymptotically approach a lower limit for input values approaching negative infinity, and asymptotically approach an upper limit for input values approaching positive infinity.
One specific sigmoid function is the "logistic sigmoid" function:

The "Logistic Sigmoid" function: 1 / ( 1 + Exp( -x ) )
The "logistic sigmoid" function can be used as an "activation function" for a mathematical model of a neuron.
The mathematical derivative of the "logistic sigmoid" can be computed as a formula, making it easy to compute an associated learning formula.
5.3 Neural network input
A "neural network input" represents an input to a neural network.

Neural network input
"Input" is the numeric value of the input.
5.4 Neural network output
A "neural network output" represents an output of a neural network.

Neural network output
"Output" is the numeric value of the output.
"Error" is a numeric value that represents the difference between the output value and a "Desired" value:
Error = (Output - Desired); // Derived from: Output = Desired + Error;
The "Desired" value represents a desired value, or an ideal value, or a correct value, that the neural network should produce as an output in response to specific inputs.
The error value is computed and assigned to "Error" by a training algorithm.
The error value is feedback to the neural network.
The neural network can adapt to reduce the difference between its outputs and the desired values; i.e., the neural network can learn, and can thus reduce future errors.
5.5 Neuron body
A "neuron body" represents the body of a neuron, which accumulates input contributions, and adds a bias, and transforms the resulting value by the "activation function" to produce an output value.

Neuron body
"InputAccumulator" is a value that represents the accumulated input from neuron links whose outputs are connected to the neuron body.
"Bias" is an adjustable value that is combined with the accumulated input value.
"Output" is a numeric value representing the output value of the neuron.
The output value is computed using the following formula:
Output = ActivationFunction( Bias + InputAccumulator );
"ErrorAccumulator" is a numeric value representing accumulated error.
Given a specific output value of the neuron body, and given a specific output error value, the accumulated error value is adjusted according to the following formula:
ErrorAccumulator += Output * (1 - Output) * OutputError;
"Rate" is a value that affects how the "Bias" value changes in response to the "ErrorAccumulator" value:
Bias += (-1) * Rate * ErrorAccumulator;
5.6 Neuron link
A "neuron link" represents a link between:
(1) an input of the neural network and an input of a neuron body;
or,
(2) an output of a neuron body and an input of another neuron body;
or,
(3) an output of a neuron body and an output of the neural network.

Neuron link
"Input" is a cache of the input to the link.
"Weight" is an adjustable value that affects how signal values and error values propagate through the link.
"Output" is a cache of the output of the link.
The value is computed using the following formula:
Output = Weight * Input;
"Error" is a cache of error of the link.
"WeightedError" is a cache of the error of the link, weighted by the weight factor:
WeightedError = Weight * Error;
"Rate" is a value that affects how the "Weight" value changes in response to the "Error" value and the "Input" value.
During neural network learning, the "Weight" value is adjusted in the following manner:
Weight += (-1) * Rate * Input * Error;
5.7 Neural network
A "neural network" contains inputs, outputs, neuron bodies, and links.
The following image depicts a simple neural network, with two inputs, and two neuron bodies in a first layer, and a single neuron in a second layer, and one output.

Example of a neural network
During a simulation of a neural network, input values propagate forward through links and neuron bodies, and eventually arrive at outputs.

Example of forward propagation in a neural network
During training, error values are provided at the outputs, and these errors propagate backwards through the neural network, resulting in the modification of weights and biases in neuron bodies and links.

Example of backward error propagation in a neural network
5.8 Neural network simulation
Definition:
"Network simulation" is the procedure used to propagate network inputs through the links and neuron bodies until reaching the network outputs.
Network simulation involves the simulation of all of its constituent links and neuron bodies.
Simulations without loops or time:
There are many possible network configurations involving loops.
There are many neuron models that depend on time.
But some of the most common applications of artificial neurons involve neither loops nor time.
The following is a mathematical model of a neuron body:
Output = ActivationFunction( Bias + InputAccumulator );
With this neuron model, and a network without "loops", we simply start from the external inputs, compute outputs of the first layer of neurons, and supply those outputs as inputs to the next layer, compute outputs for that layer, and continue through layers of neurons until the final outputs are computed.
Loops:
A network can have connections in the form of loops (or "cycles").
For example, the output of a neuron can be connected directly to an input of that same neuron, causing "feedback".
Another example is the output of neuron #1 being connected to the input of neuron #2, and the output of neuron #2 being connected to the input of neuron #1.
If you can start from some point in a network and trace a path through neurons and connections, obeying the one-way flow of the signals, and eventually arrive at that same starting point, then the path is a loop.
Loops introduce the interesting possibility of signals flowing around the network for indefinite periods of time.
Some simple models assume that it takes a specific amount of time for signals to pass through individual neurons.
In such models, signals circulate through loops with few neurons faster than signals circulate through loops with many neurons.
A neuron connected to itself will have the fastest signal circulation rate.
If a neuron has an input X, a weight W, a bias B, and a non-negative output Y (e.g., 0.0 -> 1.0), then we can form an oscillator simply by setting W = (-8) and B = +4 and connecting Y to X;
each time we simulate the neuron, the signal will be toggled to the opposite state.
A network with loops can be busy with activity even when it does not accept external signals (stimuli) as inputs.
The cellular automata rules of Conway's "Game of Life" could be implemented in a neural network, which gives you a small hint of the diversity of activity that can happen in a neural network with loops.
Finite-state machines (FSM), oscillators, volatile memory (in contrast to learning patterns via changing weights), are made possible by looping.
If a network has loops, we cannot update any outputs until we compute all outputs; thus, we require a temporary buffer to store computed outputs until we compute all outputs, and then we can commit the new output values to the neurons in the network.
Any method that updates outputs in the actual network in a progressive way, instead of an all-at-once way, introduces an arbitrary ordering in time that leads to chaos.
Physics simulations involving coupled entities, such as planets orbiting a star with mutual gravitational forces between all bodies, require the same kind of approach: compute the net forces on all bodies before updating any velocity and position.
Time-dependence:
A simple network simulation typically involves inputs causing the desired outputs after a single simulation time step.
In such a simulation, we think in terms of "number of iterations" rather than "time in seconds".
There need not be any correspondence between iterations and a time scale.
A system might be designed to do a network simulation (iteration) only when new input is available, which might occur at irregular intervals of time.
However, consider a mathematical model of a neuron that attempts to simulate the pulsing output aspect of a biological neuron.
The pulsing might be characterized in terms of time, such as pulsing at a particular frequency or having pulses whose curve extends for a particular amount of time.
We can have other time-dependent elements in a mathematical model of a neuron, such as an input accumulator whose value gets contributions from inputs but has a leak proportional to its current value.
In general, we can find an electrical circuit analogy for elements that obey certain mathematical equations, and so one can regard a neuron as a circuit with resistors, capacitors, and a non-linear amplifier.
Just as a circuit can exhibit complex time-dependent behavior, the output of a neuron can be regarded as a function that depends on its inputs and time in a complicated way.
5.9 Backward error propagation
Definition:
"Backward error propagation" is a mathematical procedure that starts with the error at the output of a neural network and propagates this error backwards through the network to yield output error values for all neurons in the network.
Backward error propagation formulae:
The error values at the neural network outputs are computed using the following formula:
Error = (Output - Desired); // Derived from: Output = Desired + Error;
The error accumulation in a neuron body is adjusted according to the output of the neuron body and the output error (specified by links connected to the neuron body).
Each output error value contributes to the error accumulator in the following manner:
ErrorAccumulator += Output * (1 - Output) * OutputError;
In a sense, all of the output errors at the next layer leak backwards through the input weights and accumulate at the output of a neuron in a previous layer.
This accumulated value is multiplied by a value that is greatest when the current output of the neuron is most neutral (most "undecided") and is least when the output of the neuron is most extreme (very "certain").
Weight change and bias change formulae:
The basis of learning is the adjustment of weights and bias values in an attempt to reduce future output errors.
Learning "Rate" is a numerical value that essentially indicates how quickly a neuron adjusts weight and bias values according to error values.
The following formula indicates how to change the weights of a neuron with a particular set of input values and its output error value:
Weight += (-1) * Rate * Input * Error;
The following formula indicates how to change the bias of a neuron given the current output error for the neuron:
Bias += (-1) * Rate * Error;
6. Training a neural network
6.1 Training procedure
One can start with a trained network and continue to reduce output error with further training, but one often starts with an untrained network.
Before training, choose random values for all weights of all neurons in the network.
I observed problems when I randomly selected values in the interval [ -1.0, +1.0 ], and I did not have problems when I selected random values from the interval [ +0.1, +1.0 ].
I mention these observations, but they might be due to my mistakes.
The purpose of random weights is to mitigate the possibility of any pathological situations in a network.
If all neurons in a network started with the same weights, the network would have no basis for increasing differentiation between neurons.
I have observed that setting all bias values to zero (0.0) is acceptable.
A training session involves going through a training set many times, perhaps hundreds or thousands of times.
For each pass through the training set, we consider each item in the training set.
A training set item has a set of inputs, and a set of desired outputs.
We simulate the network, using the set of inputs specified by the training item.
The simulation yields output values.
We propagate the errors backwards throught the neural network to compute the output errors for all neurons.
We update all weights and biases.
Caution: One academic text that discussed neural networks advocated going through the entire training set and only summing up weight changes and biases.
After going through the entire training set we have a set of sums of weight changes and bias changes.
We take these sums and update all weights and biases.
Such sums could be huge for large training sets -- and the resulting jump in weight-space would be unreasonably large.
So I think dividing by the number of training items, to get average weight change values and average bias change values, would be reasonable.
There is something appealing about computing a single weight change vector that somehow takes the entire training set in to consideration.
I don't know if I simply made a mistake in implementing the idea, but I nearly gave up entirely on neural networks because of how poorly things were turning out.
Then, when I tried the naive alternative, namely making updates upon every training item, things worked perfectly.
Considering the entire training set before doing an update has some advantages and disadvantages:
Advantage:
Single training items in the training set with extreme error (i.e., bad training item) will not make a big contribution to the update, because it will be overwhelmed by the influence of the "good" data;
Disadvantage:
If N is the number of items in your training set, your rate of progress to the optimal weight vector will be divided by N.
Or, for a given distance you will have only a fraction of direction hints along the way compared to the naive approach;
Perhaps this technique will work for you, but try out the naive approach before you give up on neural networks in utter frustration!
6.2 Failure to reduce error
Training may fail to reduce the overall error for the training set.
It is important to detect a failure to reduce error.
The following list describes causes of failure to reduce error, and possible solutions.
The items in the list are listed in approximate order of probability, with the first item being most probable.
(1) The weight combination has reached a local minimum of the error surface, and is "stuck";
Solution : Start a new simulation with new random weights.
(2) The network has too few neurons or layers to encode all of the patterns in your training set;
Solution : Cautiously entertain the possibility of adding layers or neurons.
(3) One or more items in your training set contradicts or is grossly inconsistent with your other training items;
Solution : Check your data set for irregularities.
Find the test items that yield the most error for your trained network.
Look in to techniques to average weight changes over the entire data set to reduce the influence of any bad cases.
(4) The learning rate is too high (anything over 1.0 is probably excessive), and the updates always overshoot the goal;
Solution : Reduce learning rate.
(5) The learning rate is too low (anything below 0.01 might be too small), and the network really is converging on the ideal weight combination -- but is too slow;
Solution : Increase learning rate.
Training a two-layer, three-neuron network to match the exclusive-or (xor) function, can, despite the simplicity of the function, fail to converge.
This can be surprising and frustrating.
However, the solution is to simply set all neuron link weights to new random values and then attempt to train the network again.
In the case of training a network to match the exclusive-or (xor) function, random positive weights seem to lead to successful learning every time, whereas certain combinations of positive and negative weights sometimes cause the training to fail dramatically.
The need to select new random initial weights to recover from a failure to converge is an unfortunate consequence of the combination of the learning procedure.
The learning procedure is essentially searching for a global minimum by steepest descent on a surface, and the potential for the presence of a "local minimum" in which the search can become trapped.
6.3 Overall training error
The overall error of a network can be characterized by the square-root of the average of squared errors, or "root-mean-square" (RMS).
The error at any specific network output is given by the following formula:
Error = (Output - Desired);
The sum of squared errors for a single training item is given by the following formula:
double squaredError = 0.0;
foreach (NeuralNetworkOutput output in ListOfOutputs)
{
squaredError += (output.Error * output.Error);
}
The sum of squared errors for the entire set of items in a training set is the sum of squared errors of the individual items. The following code shows how the squared errors for the entire set of training items can be computed:
double squaredError = 0.0;
for
(
int trainingItemIndex = 0;
trainingItemIndex < totalTrainingItems;
trainingItemIndex++
)
{
trainingSet.SetAllInputNodeValues( trainingItemIndex );
Simulate( propagationIterations );
trainingSet.SetAllOutputNodeErrorValues( trainingItemIndex );
PropagateErrors( propagationIterations );
UpdateWeightsAndBiases();
foreach (NeuralNetworkOutput output in ListOfOutputs)
{
squaredError += (output.Error * output.Error);
}
}
The overall root-mean-square (RMS) of the error is given by the square root of the average of the squared errors:
double rmsError = Math.Sqrt( squaredError / (double)totalTrainingItems );
This value is one way to characterize the overall error of a network considering all training cases.
7. Learning
Learning occurs when the weight and bias values of neuron links and neuron bodies are adjusted in accordance with specified network inputs and the output error values.
Consider a neural network with two inputs (x1 and x2), and two links (with weights w1 and w2), and one neuron body, and one output (y).

Neural network with two inputs, and one neuron body, and one output
We train this neuron by supplying inputs, computing the output, computing the error, computing weight and bias changes, and updating the weights and the bias, arriving at new weights ( w1', w2' ).
There is a very interesting way to visualize this process.
We can regard the set of weights as a vector in a multi-dimensional space. For example, for two weights we have the vector W = (w1, w2) in a two-dimensional "weight space".
When weights are adjusted, we have a new weight vector W' = (w1',w2').
We can visualize this as a point W moving to a new point W' as part of a process to minimize output error.
Normally one wouldn't compute the output error for all possible weight combinations, because the hope is that the weight adjustment process will efficiently head toward the best combination.
However, let us plot the surface that essentially shows how well a neuron satisfies all items in a training set as a function of its two weights:

Sum of squared errors for a specified training set as a function of two weights (w1, w2)
Basically, the goal of learning is to descend to the lowest level of this surface, where error is minimized.
Once we find the point W = (w1, w2) that yields the minimum value on this surface, learning is finished and then we can simply use the trained neuron.
The following graph shows the output of a trained neuron as a function of all possible inputs X = (x1, x2):

Neuron output as a function of two inputs (x1, x2) for a weight combination that minimizes squared error
Even though the weighted sum for this two-input neuron is simply (w1*x1 + w2*x2), the activation function turns a simple rotated plane in to a cliff.
This surface has the correct output values for all input combinations (x1, x2) specified by our training set.
But you can imagine how input vectors X = (x1, x2) similar to training values would also lead to the proper output values; this feature of neural networks is called "generalization" and is the main value of neural networks.
As we attempt to "descend" the surface of squared error, we must "leap before we look"!
We update the weight vector and bias, and then we evaluate the "height" of the surface at our new location.
One consequence of this is that we might move to a point with a more extreme error.
Another consequence is that it might take a while to descend back to the depth of our previous location.
The possibility of "leaping" to more extreme peaks and valleys of the error surface is directly related to the "learning rate", because the learning rate determines how much influence error values have on our weight and bias changes.
The following graph shows how increasing the learning rate hastens our arrival at lower positions on the squared error surface, where error is minimized.
The graph also shows that increasing the learning rate also introduces the possibility of making bad steps:

Short term trend of root-mean-squared (RMS) error for the entire training set over several training iterations, for learning rates 0.1, 0.5, 1.0, and 2.0.
Here is a graph of the root-mean-squared output error of a multi-layer network with a training set with 19386 items that experienced many bad steps on the path to the best weight vectors:

Training sometimes encounters spikes in the root- mean-squared (RMS) error, when error increases for some iterations before resuming a decreasing trend.
Sometimes the trend is simply smooth convergence to the desired set of weights:

Trend of root-mean-squared (RMS) error for the entire training set over several training iterations, for learning rates 0.1, 0.5, 1.0, and 2.0.
8. Example: Exclusive-or (xor)
"Exclusive-or" (xor) is a function that accepts two Boolean inputs and yields a single Boolean output according to the following table:
|
X1
|
X2
|
Y = xor( X1, X2 )
|
|
0
|
0
|
0
|
|
0
|
1
|
1
|
|
1
|
0
|
1
|
|
1
|
1
|
0
|
In general, a single neuron has inputs {x1, x2, ...}, entering through links with weights {w1, w2, ...}.
The neuron computes an intermediate quantity d = bias + (w1*x1 + w2*x2 + ...), which can be regarded as identifying which plane, in an infinite set of parallel planes, contains a specified point with coordinates {x1, x2, ...}.
The neuron computes an output value, y = ActivationFunction( d ), which has the effect of splitting the infinite set of parallel planes in to two sets, with one set producing low output values, and the other set producing high output values.
Thus, a single neuron splits multidimensional space in to two regions, separated by the plane bias + w1*x1 + w2*x2 + ... = 0, and assigns low output values to points in the region on one side of the plane, and assigns high output values to points in the region on the opposite side of the plane.
Thus, if two sets of points in multidimensional space have distinct classifications and can be completely separated by a plane, then a single neuron can be used to correctly classify points from those sets as belonging to one set or the other.
The exclusive-or (xor) function classifies points in two-dimensional space (with coordinates (x1, x2)) such that points in the set { (0,0), (1,1) } are classified as producing an output of "0", and points in the set { (0,1), (1,0) } are classified as producing an output of "1".
There is no single "plane" (in this case, a line) that can separate those four points in to the two sets.
Therefore, a single neuron cannot be used to classify points according to the exclusive-or (xor) function.
A single neuron can only split a space of points in to two regions.
The exclusive-or (xor) function classifies points in a manner that essentially divides a two-dimensional space in to three regions (or, alternatively, four regions).
Two neurons can split two-dimensional space in to three regions (e.g., by two distinct parallel lines), and can thus be used to classify points according to the exclusive-or (xor) function.
A third neuron can be used to combine the outputs of the other two neurons in to a single output.
The following neural network, with two inputs, and two neuron bodies in a first layer, and a single neuron in a second layer, and a single output, can be used to classify points according to the exclusive-or (xor) function.
The following neural network can either be trained to compute the exclusive-or (xor) function, or the neural network can simply have its weight and bias values assigned in a manner that produces the desired behavior.

A neural network capable of classifying points according to exclusive-or (xor)
The computer code associated with this document demonstrates training the neural network shown in the diagram above to match the exclusive-or (xor) function.
The neural networks sometimes fails to learn the function, but the software can simply be restarted to try learning with a new set of initial weights.
If the software successfully learns the exclusive-or (xor) function, then the output resembles the following:
x1 = 0.0000 x2 = 0.0000 y = 0.0172 error = 0.0172
x1 = 1.0000 x2 = 0.0000 y = 0.9802 error = -0.0198
x1 = 0.0000 x2 = 1.0000 y = 0.9839 error = -0.0161
x1 = 1.0000 x2 = 1.0000 y = 0.0154 error = 0.0154
The output (y) is within 2% of the desired value for each of the four combinations of the variables (x1, x2).
Although the network was trained to learn output values for only four combination of variables (with values 0.0 and 1.0, representing Boolean values), the inputs to the neural network can be set to any arbitrary floating-point values.
The following image shows the output of the trained neural network for many combinations of input values:

A neural network capable of classifying points according to exclusive-or (xor)
The surface represents the output of the neural network for all possible input combinations (x1, x2) in the ranges [ -2.0, +2.0 ].
The output is close to 0.0 at the lowers areas of the surface, and the output is close to 1.0 at the highest areas of the surface.
Note that the surface is low for at the points { (0,0), (1,1) }, and the surface is high at the points { (0,1), (1,0) }.
The network was only trained to produce desired outputs for four specific combinations of input variables, but the neural network also produces outputs for all other possible combinations of input values.
The ability of neural networks to produce reasonable responses for general cases after being trained for specific cases can be regarded as "generalization".
Any process that fits data points to a model, such as fitting points to a line or other curve, also produces a "generalizing" effect, so the fact that fitting a neural network to produce desired outputs for specific lessons results in a kind of generalization is not extraordinary, but it is nonetheless interesting to observe the ability to generalize from specific cases.
9. Example: Tic-tac-toe ("Naughts and Crosses")
9.1 Introduction
Tic-tac-toe ("Naughts and Crosses") is a simple game played on a 3 * 3 grid of cells that can be marked with "O" or "X".
Players alternately place "O" and "X" marks in unoccupied cells until one of the players completes a row, column, or diagonal.
Because there are 3 rows, and 3 columns, and 2 diagonals, there are eight winning patterns for each player.

Tic-Tac-Toe board and winning patterns
It is trivial to write a recursive function that explores all possible Tic-Tac-Toe games, because the maximum duration of the game is nine moves.
At each point in the game we simply examine the results of moving in each of the remaining unoccupied cells.
Such a function can confirm that a Tic-Tac-Toe game played with "perfect players" will end with no winner.
9.2 Training a neural network to indicate the best moves
A recursive function can explore all possible games and determine the best move for each board configuration.
We add each board configuration (inputs), and the best move (desired outputs), to a list of training items.
We then train the network to produce the desired outputs for each set of inputs.
The network will have 9 inputs corresponding to each cell of the grid, and the input values will be limited to the following values:
0 : Unoccupied cell
+1 : Protagonist player
-1 : Opponent player
The network will have 9 outputs corresponding to each cell of the grid, and the output values will be limited to the following values:
0 : Do not move here
1 : Move here
Eight outputs will be set to "0", and one output will be set to "1".
Thus, after training the neural network, a board configuration can be specified as input, and the neural network will indicate the best move.
The output closest to "1" will indicate the best move, and all other outputs should be close to "0".
In general, any function with Boolean parameters and Boolean outputs can be represented by a neural network with two layers of neurons.
The first layer of neurons can divide the multidimensional space in to regions, and the second layer combines the region classifications to produce the appropriate output values.
The Tic-Tac-Toe neural network produces Boolean outputs, and although the inputs have three states ( -1, 0, +1 ), we could, in princple, convert these few discrete input values to a set of Boolean inputs.
Therefore, two layers of neurons should be sufficient to learn Tic-Tac-Toe.
Because the network has 9 outputs, there are 9 neuron bodies in the final (second) layer.
The only remaining neural network design decision is deciding the number of neuron bodies to put in the first layer of the neural network.
To make this decision, computer code can generate and train a neural network with N neurons in the first layer.
The ability of the neural network to learn the complete training set for Tic-Tac-Toe can be graphed.
The following graph shows the overall training set error during training for each of 48 different neural networks, with N = 1,2,...,48 neurons in the first layer.

Overall training set error during training, for N = 1,2,...,48 neurons in the first layer (N = 1 is at the top, and N = 48 is at the bottom, and most intermediate curves are lower for higher values of N)
Another way to visualize this trend is to form a surface from the sequence of curves:

Overall training set error during training, for N = 1,2,...,48 neurons in the first layer (N = 1 is at the back, and N = 48 is at the front)
Thus, we see that as we approach N = 48 neurons in the first layer, the network seems to be able to accept all training cases.
Anything fewer than 48 neurons levels seems insufficient to learn the complete set of cases.
For low numbers of neurons, each additional neuron significantly reduces the overall error.
However, when the number of neurons is close to the number required to learn the entire set of lessons, each additional neuron only reduces the error by a relatively small amount.
The following image shows a neural network with 9 inputs, and 48 neuron bodies in the first layer, and 9 neuron bodies in a second layer, and 9 outputs.

A neural network capable of learning to play tic-tac-toe
The computer code associated with this document includes code to build and train the neural network shown above.
The training set has 4520 training items.
In 200 training iterations (involving 3 propagation steps, for a total of 200 * 4520 * 3 = 2712000 simulation steps and the same number of error propagation steps), the overall error decreased from 1.520 to 0.153.
(Those numbers can vary according to random initial conditions.)
The training required several minutes.
The following are two examples of specified inputs and produced outputs of the trained neural network:
Scenario #1:
Input:
1.0000 -1.0000 0.0000
0.0000 1.0000 -1.0000
-1.0000 0.0000 0.0000
Best move:
0.0001 0.0000 0.0676
0.0001 0.0000 0.0000
0.0000 0.0000 0.9870
Scenario #2:
Input:
-1.0000 -1.0000 0.0000
1.0000 1.0000 0.0000
0.0000 0.0000 0.0000
Best move:
0.0000 0.0000 0.0859
0.0000 0.0000 0.9819
0.0000 0.0000 0.0000
The network was trained to produce the best moves for the player whose mark corresponds to "+1".
The best move for the opponent player, whose mark corresponds to "-1", can be found by multiplying all inputs by (-1) before simulating the neural network.
10. Training neural networks
The following is a quote from "Artificial Intelligence" (3rd edition; Addison Wesley; 1993), by Patrick Henry Winston, chapter 22, Learning by Training Neural Nets, p. 468.
Neural-Net Training is an Art
You now know that you face many choices after you decide to work on a problem by training a neural net using back propagation:
* How can you repr