## Transcribed Text

The resulting optimal pattern x
from the equation
argmin | where train is the training dataset, and Xt the unseen/query pattern, is used for
x € D, train
which learning purpose?
Used in multi-layer feed-forward networks to train the query pattern xt, and find its output x.
Used in memory-based learning to locate the stored pattern closest to the query pattern xt, and assign it the known corresponding class/label of x
.
Used in Boltzmann learning to derive the probability of flipping the state of a given neuron input xt.
Used in self-organising maps to locate how closely ordered the query pattern Xt is to the training dataset.
In radial basis function neural networks, when centres and bandwidth parameters for the radial basis functions are not given, the user does not only have to train
the network weights, but also the unknown centres and parameters. In this context, which of the following statements is not
correct?
The centres can be chosen using memory-based learning, the bandwidths using matrix pseudo-inversion, and then the weights using the standard least-
squares fit.
The centres, the bandwidths and the weights can all be trained simultaneously using gradient descent on the error function with respect to each of these
parameters.
The centres can be chosen using a clustering method, the bandwidths using the cluster spreads, and then the weights using the standard least-squares fit.
The centres can be chosen from random data samples, the bandwidths using distance statistics from the dataset, and then the weights using the standard
least-squares fit.
You are given a single-layer perceptron with two inputs X1 and x2, and one threshold corresponding to a fixed input denoted as xo=-1. The perceptron is based on
a signum activation function of output values +1 (for non-negative input) and -1 (for negative input).
At some point, during the task of training, you observe your network to have weights:
wo=+1.5 (the threshold) W1= +2.5, W2= +0.4 (weight indices correspond to input indices).
Then, you need to train for pattern (x1, x2)(0, 0) with desired outcome d= -1, and adjust the weights accordingly, with a learning rate of 0.5. What will those
weights become as soon as you train for this pattern?
(+2.5, +2.5, +2.9)
(-1.5, -2.5, -1.4)
(+0.5, +0.5, +0.5)
They will not change at all.
Question 3
2 points
Sa
What is the main reason multi-layer perceptrons have multiple layers?
To learn nonlinearly separable problems.
To enable the use of nonlinear activation functions, such as the sigmoid one.
To allow easy calculation of the derivatives of the error with respect to the weights, when not all patterns are available simultaneously.
To accelerate convergence and learning.
Why does the linear-least-squares algorithm allow the minimisation of the error and therefore the recovery of the optimal weights in a single step, when all patterns
are available simultaneously?
The single step is permitted, as we do not have available multiple layers but the procedure is applicable to single-layer neurons only.
Because direct application of the optimisation leads to a linearly separable case where all patterns can be separated perfectly, so an iterative procedure is not
necessary for the task of regression.
Because direct application of the Gauss-Newton method to minimise the error, leads to a formula that depends on the Jacobian matrix, which does not need
knowledge of all the patterns in the available training dataset.
Because direct application of the Gauss-Newton method to minimise the error, leads to a formula that depends on the Jacobian matrix which is equal to
the negative of the data matrix; therefore the final weight-change formula does not depend on the current weights.
Consider the logistic sigmoid activation function 0(x)= = 1 + exp(-ax) for multi-layer feed-forward networks, where X is the induced local field and a the
shape parameter. This is plotted in the figure together with its derivative $'(x). Why is the middle area of "maximal weight change increase" important?
Logistic
0.9
with a=0.5
0.8
<(x)
Maximal weight
change area
0.5
0.4
0.3
0.2
o'(x)
0.1
95
Because when the activation's input x is very low or very high, the neuron saturates and this requires that the weight changes are maximal to avoid overfitting.
Because when the activation's input X is close to zero, then its derivative is maximised and therefore the net classification error of the neural network is
maximum, and consequently the weight changes are the largest.
Because when the activation's input x is very high positive, the neuron's derivative becomes almost zero and this forces the weights to move away from their
maximal weight change area.
Because when the activation's input X is close to zero, then its derivative is maximised and therefore the local gradients are maximal, and consequently the
weight changes are the largest.
O',
'
You are given weights Wji multiplying the signal from neuron indexed i to another neuron indexed j, an activation
function
with
derivative
output
at
pattern
or
time n for neuron indexed i denoted by yin(n), and error e j at neuron indexed j. The indices i and j may refer to different layers. Considering the local gradient
which statement is correct for multi-layer feed-forward networks?
This equation cannot be applied to the training of this type of neural networks, but is only suitable to radial basis function networks since they have a single
hidden layer.
This equation can be applied to both hidden and output layers of the network and can then be used to calculate the weight change
Wjj-
This equation can be applied only to the hidden layer of the network and can then be used to calculate the weight change
Wjj.
This equation can be applied only to the output layer of the network and can then be used to calculate the weight change AWjj-
One excellent example of the credit assignment problem is:
The case when in single-layer perceptrons, decision boundaries cannot converge when the data is not linearly separable.
The case when the correct kernel cannot be set in support vector machines to allow an accurate optimisation of the dual problem.
The case when the centers and bandwidths of the basis functions in radial basis function neural networks are not always assigned credible values from the
users, to allow accurate weight calculations.
The case when in multi-layer feed-forward networks, the errors of the hidden neurons are not explicitly known since their desired responses cannot obviously
be directly available.
In radial basis function neural networks, when the centres and the parameters of the radial basis functions are known and all patterns are available, what is the
fastest and most typical way to train the weights w? Assume that you are given a vector y of corresponding outputs and a matrix with elements the values of
the radial basis functions on the different patterns in the dataset?
w=p+y
+
Setting
=
to perform a least-squares fit using matrix pseudo-inversion.
By performing back-propagation to minimise the fitting error given by 1y-owll
across the different layers, in a layer-by-layer fashion.
By finding the most suitable type of radial basis function (e.g., multiquadric, inverse multiquadric or Gaussian) to solve the problem and acquire the smallest
possible error on the given dataset.
By solving the linear system PW = y, which corresponds to setting W = using matrix inversion operations.
You are given weights Wji multiplying the signal from neuron indexed i to another neuron indexed j, an activation function 0 with derivative 0' , and and output at
pattern or time n for neuron indexed i denoted by yi(n). Other quantities are similarly defined. The indices i and j may refer to different layers. Other quantities are
defined similarly. Considering the local gradient which statement is correct for multi-layer feed-forward
networks?
In each training iteration, the equation can only be applied firstly to the last network layer (the output layer), then the last hidden layer, then the layer before
that, and carry on till the first hidden layer is processed.
In each training iteration, the equation can only be applied to the very first, and the very last hidden layers, so the error is backpropagated accordingly.
In each training iteration, the equation can be applied to all hidden layers in any order but until all hidden layers are processed exactly once, but never applied
to the last (output) network layer.
In each training iteration, the equation can only be applied firstly to the last hidden layer, then the layer before that one, and carry on till the first hidden layer is
processed.
What type of activation function do the single-layer perceptron for classification typically use?
1
The sigmoid function defined as
, where V is the induced local field and a the activation parameter.
1 + exp ( - a.v) .
The signum function that outputs +1 or - -1 depending on the sign of the induced local field.
1
if V 0.5
The piecewise linear activation defined as d(v): =
if V < - 0.5
V
+ 0.5 otherwise
The linear activation function 0(v)= V. =
If you provide a nonlinearly separable dataset, such as the XOR problem, to a single-layer perceptron with signum activation and start training pattern-by-pattern,
when will the training terminate?
It will terminate after a number of epochs have passed, and all patterns have been completely trained with zero error.
It will terminate after a number of epochs have passed, and the error gradient is zero.
It will terminate when the number of training epochs is equal to the number of layers multiplied by the number of problem inputs.
It will never terminate.
You are given weights Wji multiplying the signal from neuron indexed i to another neuron j, an activation function 0, and output at pattern or time n for neuron
indexed j denoted by y/(n). The indices i and refer to different the
j may layers. Using equation to estimate the output of various
neurons, which statement is correct for multi-layer feed-forward networks?
This is the equation for the forward-pass, where the inputs are presented to the first layer, then the next layer calculates the neuron outputs using the weights
and inputs, and then this carries on to the next layer till we reach the output layer. This pass is part of the training procedure and is also useful to the online
operation of the network.
This
is
the
basic equation adequate for performing all weight adjustments in all such networks using logistic activation functions.
This is the equation for the backward-pass, where the inputs are presented to the last layer, then the previous layer, and then this carries on till we reach the
first layer. This pass is part of the training procedure only.
This
is
the
equation for the forward-pass, where the inputs are presented to the first layer, then the next layer calculates the neuron outputs using
the weights and inputs, and then this carries on to the next layer till we reach output. This pass is not part of the training procedure, and it is used for online
operation only.
Consider the Delta rule = where ek is the error of a pattern at time n at neuron k, and Xj the signal from an input or neuron indexed by j to
neuron k, and n is the user-defined learning rate. Which statement is correct?
With this rule, the synaptic weight adjustment is proportional to the location of support vectors and oriented distance from decision hyperplanes in support
vector machines.
This is the fundamental equation for self-organising maps that produce topologically ordered pattern representations in the feature space controlled by error
ek.
With this rule, the synaptic weight adjustment is proportional to the error signal and the input signal of the synapse under adjustment, and is generally referred
to as error-correction learning.
This is the fundamental equation for Hebbian learning that use time-dependent, highly local and strongly interactive mechanisms to increase synaptic
efficiency.
Question 17
2 point
An activation function for a neuron of a neural network:
Can allow the rate for adapting the weights to be equal to the derivative of the error.
Is
used
to
transform the hidden weights to the output sent out from that neuron to other neurons or to the user.
Is used to transform the induced local field to the output sent out from that neuron to other neurons or to the user.
Is always a linear function with respect to the weights.
The adaptive process defined for self-organising maps, is designed to do what exactly?
It allows adaptation of the neural network to have multiple hidden layers together with kernel inner products, such that they can deploy nonlinear classification
boundaries.
It adapts the weights of the network so that the complexity is reduced and overfitting to the provided dataset is minimised.
It
adapts the weights of a set of cooperating neurons so that their weights are moved closer to the current input pattern.
It adapts the neighbourhood of the cooperative neurons so that they are controlled by a Gaussian function of their neighbourhood size.
Which of the following statements is correct?
The least-mean-squares algorithm becomes identical to the linear-least-squares algorithms when the patterns are all linearly separabale.
The least-mean-squares algorithm for single-layer perceptrons, adapts the weights by iterative applications of the update Aw = n(x(n)-w(n)), in a
pattern-by-pattern basis, where the distance between patterns x and weights w at time n (for some user-defined learning rate n) is used to reduce the error.
The least-mean-squares algorithm for single-layer perceptrons, adapts the weights by iterative applications of the update Aw = ne(n)x(n), in a pattern-by-
0e(n)
pattern basis, where the gradient
of the error at time/pattern n (for some user-defined learning rate n) is used to reduce the error.
ow
The least-mean-squares algorithm is suitable for both supervised and unsupervised learning when the dataset is linearly separable.
We have the weight update rule w(t+1)=w(t)+n[d(t)-y(t)]x(t), = where n is the learning rate, t the time/iteration of training, d the desired response and
y
the neuron's output from a signum activation, for input pattern X. For what type of network is this rule suitable for?
For single neurons performing two-dimensional function regression.
For support vector machines performing classification with polynomial kernels for up to two inputs.
For single neurons performing binary classification.
For neurons relying on Hebbian learning.
You are given a single-layer perceptron with two inputs x1 and x2, and one threshold corresponding to a fixed input denoted as x0= -1. The
perceptron is based on a signum activation function of output values +1 (for non-negative input) and - -1 (for negative
input).
At some point, during the task of training, you observe your network to have weights:
wo=+1.5 (the threshold) w1=+2.5, W2= +0.4 (weight indices correspond to input indices).
Then, you need to train for pattern (x1, x2)=(1,0) with desired outcome d=-1, and adjust the weights accordingly, with a learning rate of 0.5. What
will those weights become as soon as you train for this pattern?
(+2.5, +1.5, +0.4)
(+2.5, +1.5, +1.4)
(+3.5, +1.5, +0.4)
(+2.5, +2.5, +0.4)
Given the following neural network with 2 input units, 3 hidden units, 2 output units, and logistic sigmoid activation functions in the hidden layer and linear
activations in the output layer, how many weights does the network contain in total (including all weights and biases everywhere)?
Inputs
Outputs
16
17
18
19
You are given a single-layer perceptron with two inputs X1 and x2, and one threshold corresponding to a fixed input denoted as x0= -1. The perceptron is based on
a signum activation function of output values +1 (for non-negative input) and -1 (for negative input).
At some point, during the task of training, you observe your network to have weights:
wo=+2.5 (the threshold) W1= +1.5, W2= +0.4 (weight indices correspond to input indices).
Then, you need to train for pattern (x1, x2)((1, 1) with desired outcome d= +1, and adjust the weights accordingly, with a learning rate of 0.5. What will those
weights become as soon as you train for this pattern?
(+1.5, +2.5,+1.4)
(+1.5, -2.5,-1.4)
(+0.5, +0.5, +1.4)
(+2.5,+2.5,+2.4)You are given a single-layer perceptron with two inputs X1 and x2, and one threshold corresponding to a fixed input denoted as x0=-1. The perceptron is based on
a signum activation function of output values +1 (for non-negative input) and -1 (for negative input).
At some point, during the task of training, you observe your network to have weights:
wo=+1.5 (the threshold) W1= +2.5, W2= -0.6 (weight indices correspond to input indices).
Then, you need to train for pattern (x1, x2)=(1, 0) with desired outcome d=-1, and adjust the weights accordingly, with a learning rate of 0.5. What will those
weights become as soon as you train for this pattern?
(+1.5, -0.6, +2.5)
(-0.6, +2.5, +1.5)
(+2.5, +1.5, - -0.6)
(0, 0, -0.6)
If E(w) is the classification error of a multi-layer network given weights w and a dataset, in order to avoid overfitting various possibilities have been proposed.
Which of the following is not one of them?
A smoothness error Es(w) is added to E(w) to enable penalisation of large values of weights.
A smoothness error Es(w) is added to E(w) to enable penalisation of derivatives of the output of the neural network within a region of the input space.
A smoothness error Es(w) is added to E(w) to enable penalisation of non-linearly separable patterns closely located in the input space.
After training is complete a selective prunning of connections and weights is deployed based on the analysis of the optimum
W.
The competitive process defined for self-organising maps, is designed to do what exactly?
Given a d-dimensional input pattern X, it assigns the most competitive kernel that deploy the most flexible classification boundaries for nonlinearly separable
data.
Given a d-dimensional input pattern X, it trains the neurons to adapt their weight vector w and reduce the fitting error when all patterns are available at once.
Given a d-dimensional input pattern X, it finds the neuron indexed k with the closest weight vector Wk, so that it maps the continuous input space to a discrete
neuron space.
Given a d-dimensional input pattern X, it finds all neighbouring neurons that have weight vectors W that are targeting a common competition task.
Which one is the correct statement for multi-layer feed-forward networks?
In the sequential training mode the forward and backward passes are executed for each sample independently, while in the batch training mode weights are
adapted after all patterns are processed. The former requires less storage and may avoid local minima, but the latter is better to parallelise and employ for
mathematical analysis.
In the batch training mode the forward and backward passes are executed for each sample independently, while in the sequential training mode weights are
adapted after all patterns are processed. The former requires less storage and may avoid local minima, but the latter is better to parallelise and employ for
mathematical analysis.
In the sequential training mode the forward and backward passes are executed for samples with higher error only, while in the batch training mode weights are
adapted after all patterns with low error are processed. The former requires less storage and may avoid local minima, but the latter is better to parallelise and
employ for mathematical analysis.
In the batch training mode the forward and backward passes are executed for samples with higher error only, while in the sequential training mode weights are
adapted after all patterns with low error are processed. The former requires less storage and may avoid local minima, but the latter is better to parallelise and
employ for mathematical analysis.
The training of support vector machines relies on what type of optimisation?
A quadratic optimisation problem in the space of weights subject to linear constraints.
Least-squares optimisation to minimise the fitting error.
Back-propagation through the layers of support vectors with momentum and regularisation for better complexity management.
Self-organisation of neural units to deploy support vectors that establish the separation margin.
Which of the following statements related to support vector machines (SVMs) is correct?
SVMs find a linear boundary that perfectly separates the classes even when the data is not linearly separable.
SVMs perform classification in both linearly and nonlinearly separable datasets without the need to train weights because it can employ kernels.
SVMs find any legitimate linear boundary that perfectly separates the two classes when the data is linearly separable.
SVMs find the unique decision hyperplane that provides the widest separation zone between the classes when the data is linearly separable.
To train support vector machines a dual problem is optimised based on the maximisationmax
where Ai are the Lagrange
A
i
j
multipliers, di the desired responses for the i-th pattern Xj and similarly for j, together with some constraints for the optimisation. How is this formulation helping
with extending the standard machines to handle nonlinearly separable datasets?
Because the dual problem is a quadratic constrained one that does not depend on the weights and also has a different number of constraints compared to the
primal problem.
Because the inner product between patterns Xi and Xj appearing in the objective function allows its replacement with an inner product kernel that introduces
nonlinearities to the original feature space.
Because the product between the multipliers Ai and 1j appearing in the objective function allows to ignore the original weights of the design.
Because the inner product between patterns Xi and xj appearing in the objective function allows the use of slack variables and regularisation that introduce
nonlinearities to the original feature space.
Which of the following is not a difference between multi-layer perceptrons (MLPs) and radial basis function networks (RBFs)?
MLPs are suitable for nonlinearly separable classification problems, whereas RBFs are mainly designed for linearly separable problems.
The RBF hidden nodes are nonlinear but its output nodes linear in weights; however, both hidden and output MLP nodes can have nonlinear nodes.
RBFs have activation functions based on distances between inputs and basis function centres, while MLPs are based on the inner product between weights
and inputs.
MLPs can have multiple hidden layers, but RBFs just a single hidden one.
For the development of support vectors machines, using the definitions implied within the figure below
X2
x
r
Optimal
hyperplane
xp
Wo
f
(0,0)
X1
what is the distance r between pattern x and the hyperplane wo'x + bo=0
Wo T. X + bo
The distance is given by r =
||wo||2
2
Woll2 + r bo
The distance is given by r =
|woll2 2
T.
The distance is given by r = exp
Wo x + 2 bo
||woll2
)
Wo T. X + bo
The distance is given by r=
.
W 0 2
The optimal brain damage method relies on second-order differential information of the error function E(w) with respect to the weights W. How does this work?
Before training is completed, it uses Taylor-series expansion to express E(w) as a quadratic approximation around a randomly selected point with diagonal
Hessian matrix and removes the weights that have small effect in the error reduction.
After training is completed, it uses Taylor-series expansion to express E(w) as a quadratic approximation around the optimum value of W with a full Hessian
matrix and removes the weights that cause overfitting.
After training is completed, it uses Taylor-series expansion to express E(w) as a quadratic approximation around the optimum value of W with diagonal Hessian
matrix and removes the weights that have small effect in the error reduction.
After training is completed, it uses Taylor-series expansion to express E(w) as a quadratic approximation around the optimum value of w with diagonal Hessian
matrix and removes the weights from neurons whose activation values are in the middle range.
Ask for a larger fish dataset with more species than just sea basses and salmons, so that the system can generalise and detect other fish species in the future.
Use no more than two fish features (such as fish length and width) to pass them to a linear or nonlinear classifier, so that the system is fast enough on the fast-
running fish conveyor belt.
Use a neural network with a large enough number of neurons so that the classification accuracy for the training dataset reached as close as possible to 100%.
Use any machine learning algorithm or neural network that is powerful enough to reduce the testing classification error, after its weights are adjusted using the
training dataset.
Which type of learning based on neural networks is calculating the neural energy as the quadratic
where Wkj is the synaptic weight
connecting neurons indexed j and k, with corresponding states Xj and Xk?
Memory-based learning that use the energy to calculate the proximity of query patterns.
Boltzmann machines that use the energy to calculate state-flipping probabilities.
This is the fundamental equation for reinforcement learning.
Competitive learning that use the energy to regulate inhibitory neuron behaviour.
The training of self-organising maps, and in particular, the way their weights adapt to the provided input data patterns, is largely based on:
The derivation of memory-based learning from competitive learning.
The derivation of inner product kernel from standard inner product.
The derivation of a forgetting term from a user-defined learning rate.
The derivation of competitive learning from Hebbian learning.
The cooperative process defined for self-organising maps, is designed to do what exactly?
This process locates other excited neurons in the vicinity of the winning neuron from the competitive process, that will be later needed to participate to the
weight adjustment phase for the current pattern.
This process locates all other neurons in the same layer that cooperate and collaborate in transforming the outputs of the previous layer to new outputs to be
sent to the subsequent layers.
This process performs the weight adjustment of the cooperating neurons so that the overall fitting or classification error reduces maximally given a set of
patterns.
This process locates other excited neurons in the vicinity of the winning neuron from the competitive process, that will be needed to be pruned so that
the
complexity of the network is reduced for the current pattern.
You are given a matrix X with rows the available n (not very large) patterns of d dimensions each, and a vector d with n desired responses that are real
numbers. You are required to solve a regression problem, using a single-layer perceptron with no activation function. How can the weights W be calculated as
quickly a possible?
By multiplying the pseudo-inverse of X with the vector d to obtain all the weights W of the neuron in a single step.
By performing a least-squares optimisation using gradient descent on the error function that combines X and d.
By performing a momentum-based gradient descent in the error space with respect to the weights.
By controlling the complexity of the network to avoid overfitting during the training of the weights in the different layers.
You are given weights Wji multiplying the signal from neuron indexed i to another neuron indexed j, an activation function 0 with derivative 0 , output at time n for
neuron indexed i denoted by yin(n), and error ej at neuron indexed j. Other quantities are similarly defined. Note, that the indices i and j may refer to different
layers, and remaining equation quantities are defined
similarly as above. Using these definitions, there are two versions of local gradient:
?
Eq.(A): and
Eq.(B): k
Which statement is correct for multi-layer feed-forward networks?
Eq.(B) is applied first to the output network layer to produce the local gradients oj for all neurons in the output layer. Then, Eq.(A) is applied to the layer before,
which is the last hidden layer. Then the process back-propagates till all local gradients have been calculated.
Eq.(B) is applied firstly to the last hidden layer. Then the process back-propagates till all local gradients have been calculated. When the back-propagation
process has arrived to the first hidden layer, Eq.(A) is used. So, finally all local gradients oj for all neurons are used for the weight updates.
These two equations are identical when the neurons use linear instead of logistic or hyperbolic tangent activation functions, so either can be used to estimate
the local gradients oj across the neural network.
Eq.(A) is applied first to the output network layer to produce the local gradients of for all neurons in the output layer. Then, Eq.(B) is applied to the layer before,
which is the last hidden layer. Then the process back-propagates till all local gradients have been calculated.
In general, competitive learning in neural networks refers to cases where:
The different activation functions compete to obtain more rapid weight changes in the different neurons and regions of the input space.
The different inputs to the first layer of the neural network compete to maximise the weight adaptation of the neurons with the highest errors.
Many support vectors compete with each other to reach closer to the decision boundary and establish a better separation geometry.
A neuron can have feed-forward and lateral connections to account for excitations and inhibitions, respectively, so that for a given input pattern, only one
neuron wins to be proximate to that pattern.
What is the difference between interpolation using radial basis functions, and neural networks based on radial basis functions?
They both work using superposition of radial basis functions, but interpolation needs one basis centred at each pattern in the dataset whereas the neural net
can have a much smaller set of basis functions placed arbitrarily or selectively in the input space.
The are identical concepts but interpolation uses gradient descent, whereas radial basis function networks use least-squares optimisation for training.
They both work using superposition of radial basis functions, but interpolation is used for regression while the neural net is designed mainly for classification
tasks in machine learning, when the dataset is not very large.
Interpolation based on radial basis functions is mainly operating in a single layer with error correction learning, while the neural net is based on the existence of
multiple hidden layers and least-squares optimisation.
Which of the following is not a characteristic of artificial neural networks?
They provide insight to biological interpretability.
They are capable of rapidly accessing centralised computer memory.
They are capable of tolerance to damage and faulty operation.
They can adapt to different learning tasks and problems.
You are given weights Wji multiplying the signal from neuron indexed i to another neuron indexed j, an activation function 0, and output at pattern or time n for
neuron indexed j denoted by vi(n). The indices i and
j may refer to different layers. Using the equation = to estimate the output of the
various neurons, which statement is correct for multi-layer feed-forward networks?
This equation can be applied selectively to any neuron with high error to enable weight adjustment, and does not need to be applied in a layer-by-layer fashion.
This equation can be applied sequentially from the first layer, then to the second, then to the third, carrying on till we reach the last. During training, only
misclassified patterns are applied to this equation to calculate the output of all neurons.
This equation can be applied sequentially from the first layer, then to the second, then to the third, carrying on till we reach the last. During training, all patterns
are subjected to this equation to calculate the output values of all and every neuron.
This equation can be applied to the output layer and never in any hidden layer of the network, unless the error is already known to be very small and less than
a user-defined threshold.
Assume that Awji(n) is the change of the weight of the synapse connecting neuron indexed i to neuron j at pattern or time n, n is the learning rate, a
momentum rate, Õj the local gradient of neuron j, 0 the activation function and vi(n) the output of neuron i at pattern or time n. What is the purpose of the update
equation Awji(n) = n o/(n) (Evs/vy) + Awjin- - 1) in multi-layer feed-forward networks?
The first term of the equation is the momentum weight update, while the second term the standard weight update aiming at accelerating learning while
suppressing oscillations.
The first term of the equation is the standard weight update, while the second term a momentum quantity aiming at increasing the gradient for the middle low-
gradient areas of the activation functions.
The first term of the equation is the standard weight update, while the second term a momentum quantity aiming at controlling complexity and overfitting of the
dataset, when the number of neurons is high
The first term of the equation is the standard weight update, while the second term a momentum quantity aiming at accelerating learning while suppressing
oscillations.
In Hebbian learning the covariance hypothesis weight adaptation rule is used to:
Avoid the issue of inadequate data for training the network, and cause overfitting to the small sample set.
Avoid the problem of synaptic saturation resulting from the exponential growth of the activity product rule.
Avoid the credit assignment problem in neural designs.
Avoid the problem of overfitting of patterns from high training error in nonlinear datasets.
The
output of neuron k is shown here to depend on inputs Xi and weights Wkj and Wjj,
and some known activation function O.
This equation corresponds to which following configuration?
A single hidden layer feed-forward network receives inputs, then transforms them through hidden neurons indexed by j, and then reach an output neuron
indexed by k.
A two hidden layer feed-forward network receives inputs, then transforms them through hidden neurons indexed by j, then proceed to a second hidden layer
indexed by i, and then reach a output neuron indexed by k.
A single layer perceptron is used to facilitate conversion from inputs to outputs twice and obtain a statistical average.
A
recurrent neural network combines current inputs and past inputs through the activation function to obtain the output indexed by k.