Under construction: raw and unedited without links!
Part IV
If the human brain were so simple that we
could understand it, we would be so simple that we couldn't!!!
If you think to build a tower, first reckon
up the cost. - St. Jerome
So far the networks we have looked at have
consisted of only one or two layers of neurodes: an input layer
and possibly an output layer. Because they have so few layers,
they are only able to take advantage of any natural coding of
information that already exists in their input. These networks
do not have the ability to interpret their input data or to organize
them into any kind of internal worldview.
About twenty years ago, Marvin Minsky and Seymour Papert of MIT wrote a book, Perceptrons, which proved that l- or 2-layered perceptron networks were inadequate for many real-world problems. Their book, combined with other contributing factors of the time, was so influential that neural network research and development was brought to a near-standstill for almost two decades. Only a few die-hard researchers continued to work in the field, and they had a great deal of difficulty in obtaining funding, tenure, and promotions. We need to look at Minsky and Papert's arguments to understand why they wrote what they did and why it had such a tremendous impact on the field. The discussion in Perceptrons was a thorough piece of reasoning. Minsky and Papert performed a careful analysis of the problem of mapping one pattern to another. In this context, mapping simply means association. That is, when we map A to 1, B to 2, and so on, we are correlating the letters with numbers. In this view, a mathematical function is a mapping of the function's value to each value of the variable(s). In many cases, all we want a neural network to do is to provide such a mapping.
...
Minsky and Papert concluded that it would
be impossible for simple perceptron networks ever to solve problems
with this characteristic. In other words, it appeared at the time
that neural networks could solve only problems where similar input
patterns mapped to similar output patterns. Unfortunately, many
real-world problems, such as the parity problem and the exclusive
OR problem, do not have this characteristic. The outlook appeared
gloomy for neural networks researchers. Minsky and Papert were
correct in their analysis of perceptron neural networks. It eventually
became clear, however, that what was needed to correct the problem
was to make the networks slightly more complex. In other words,
although a two-layer network cannot solve such problems, a three-layer
network can. While Minsky and Papert recognized that this was
possible, they felt it unlikely that a training method could be
developed to find a multi-layered network that could solve these
problems. As it turns out, there is strong evidence that multilayered
networks intrinsically have significantly greater capabilities
than one- or two-layered networks. A mathematical theorem exists
that proves that a three-layer network can perform all real mappings;
it is called Kolmogorov's theorem.
Kolmogorov's Theorem
In the mid-1950s Soviet mathematician A. N. Kolmogorov published the proof of a mathematical theorem that provides a sound basis for mapping networks.
...
Let's step back from this theoretical discussion and try to describe Kolmogorov's result more concretely. When we build a neural network of three layers, we are generating a system that performs the desired mapping in a two-step process. First, in moving from layer 1 to layer 2, the input pattern is translated into an internal representation that is specific and private to this network (thus the frequently used term hidden for the middle layer of a multilayer network).
Second, when the activity of the network moves
from layer 2 to layer 3, this internal representation of the pattern
is translated into the desired output pattern. The middle layer
of the network somehow implements an internal code that the network
uses to store the correct mapping instructions. This is important
to understand because it is one of the chief reasons that a hierarchical,
multilayer neural network is so much more powerful than a simple
neural network. Adding a hierarchy of layers to the system allows
for complex internal representations of the input patterns that
are not possible with simpler systems. The internal representation
generated by the hierarchical network may or may not be one that
is meaningful to us as humans. Researchers have spent a great
deal of time reverse engineering trained, multilayer networks
to try to decipher the codes they use. A couple of important points
emerge from such studies. First, the representation that the network
develops is not cast in concrete. If the network is reinitialized
(the weights are randomly set to new initial values) and the network
retrained on the same training data in the same training regimen,
the internal representation developed the second time will generally
be similar to but not identical with the first representation.
Furthermore, there is no way to predict which neurode will encode
any specific portion of the representation. The second important
point is that the encoding used by the network may or may not
have a bearing on any encoding scheme animals use in their brains,
nor need it make any particular sense to us. While reverse engineering
a trained neural network can provide clues to the operation of
biological networks, it is dangerous to take such clues too seriously
and assume that biological networks have to work the same way.
What Kolmogorov Didn't Say
There are some questions that Kolmogorov's
theorem does not answer. For example, it does not tell us that
the network described is ' the most efficient implementation of
a network for this mapping. Nor does it tell us whether there
is a network with fewer neurodes that can also do this mapping.
And, of course, the functions used in the Kolmogorov network are
not specifically defined by the theorem.
Kolmogorov's theorem assures us that we need not go to hundreds or thousands of layers to make a good mapping; there is no need for neural network skyscrapers. Instead the theorem demonstrates that there is in fact a way to do any mapping we choose in as few as three layers.
This agrees with our knowledge about biological systems.
Our brains are incredibly complex, but the
number of processing layers for any particular subsystem is remarkably
small for the power of its operation.
You may notice something else about the Kolmogorov
multilayer network. It appears that within each layer of the network,
there is little interaction among neurodes. Instead neurodes take
inputs from the previous layer and fan their output to the next
layer, but they do not receive inputs or generate outputs to other
neurodes on the same layer. This is quite typical of many neural
network architectures:
Such a hierarchical structure is similar to
the organization of biological systems. In fact, it now appears
that much of the brain is organized as functional modules arranged
in a parallel set of increasing hierarchies. At many vertical
levels within a given subsystem of the brain, additional interaction
occurs from other hierarchical subsystems, allowing, for example,
the visual system to interact with the motor system.
In the next chapters, we take a close look
at some hierarchical neural network architectures. ...
Application: The Neocognitron
...
Let's first review the neocognitron's general
operation. The analog pattern to be recognized-one of a set of
memorized symbols, for instance-is presented to the input stage.
The second stage recognizes the constituent small features, or
geometric primitives, of the symbol, and each succeeding stage
responds to larger and larger groupings of these small features.
Finally. one neurode in the output stage recognizes the complete
symbol. There must be as many neurodes in the output stage as
there are symbols in the set to be recognized. ...
Like the pace of a crab, backward. ' Robert
Greene
Arguably the most successful and certainly
one of the most studied learning systems in the neural network
arena is backward error propagation learning or, more commonly,
backpropagation. Backpropagation has perhaps more near-term application
potential than any other learning system in use today. Researchers
have used this method to teach neural networks to speak, play
games such as backgammon, and distinguish between sonar echoes
from rocks and mines. In these and other applications, backpropagation
has demonstrated impressive performance, often comparable only
to the far more complex adaptive resonance systems discussed in
chapter16.
Features of Backpropagation Systems
Each training pattern presented to a backpropagation
network is processed in two stages. In the first stage, the input
pattern presented to the network generates a forward flow of activation
from the input to the output layer. In the second stage, errors
in the network's output generate a flow of information from the
output layer backward to the input layer. It is this feature that
gives the network its name. This backward propagation of errors
is used to modify the weights on the interconnects of the network,
allowing the network to learn.
Backpropagation is actually a learning algorithm
rather than a network design. It can be used with a variety of
architectures. The architectures themselves are less important
than their common features: all are hierarchical, all use a high
degree of interconnection between layers, and all have nonlinear
transfer functions. ... Another way to think of the action of
the middle layer is that it creates a map relating each input
pattern to a unique output response. We have already seen one
such mapping network, the Kohonen feature map. Because the backward
transmission of errors allows backpropagation networks to generate
a sophisticated and accurate internal representation of the input
data, they are often more versatile than these other mapping networks.
For example, counterpropagation networks, discussed in chapter
15, can have difficulty with discontinuous mappings or with mappings
in which small changes in the input do not correspond to small
changes in the output. Backpropagation networks can generally
learn these mappings, often with great reliability. The middle
layer, and thus the hierarchical structure of the system, is the
source of the improved internal representation of backpropagation
networks. Physically this representation exists within the synapses
of the interconnects of the middle layers of the network; the
hither the interlayer connectivity is, the better is the ability
of the network to build a representation or model of the input
data.
Building a Backpropagation System
Let's construct a typical backpropagation
system in our minds and see the way in which it works. First we
need to select an appropriate architecture. For our purposes,
three feed-forward hierarchical layers are sufficient, with each
layer fully connected to the following layer.
...
The Backpropagation Process
To teach our imaginary network something using backpropagation, we must start by setting all the adaptive weights on all the neurodes in it to random values. It won't matter what those values are, as long as they are not all the same and nor equal to 1. To train the network, we need a training set of input patterns along with a corresponding set of desired output patterns. The first step is to apply the first pattern from the training set and observe the resulting output pattern. Since we know what result we are supposed to get, we can easily compute an error for the network's output; it is the difference between the actual and the desired outputs. This should sound familiar. We encountered the same rationale when we talked about the adaline in chapter 6. In that chapter, we used the delta rule which computed the distance of the current weight vector from some ideal value and then adjusted the weights according to that computed distance. The learning rule -that backpropagation uses is quite similar. It is a variation of the original delta rule called the "generalized delta rule." The original delta rule allowed us to adjust the weights using the following formula: multiply the error on each neurode's output by the size of that output and by a learning constant to determine the amount to change each neurode's weights. By this formula, the change in the weight vector is proportional to the error and parallel to the input vector. ...
...
Let's review the entire backpropagation process.
First we present an input pattern at the input layer of the network,
and this generates activity in the input-layer neurodes. We allow
this activity to propagate forward through each of the layers
of the network until the output layer generates an output pattern.
Remember that we initially set the weights on all modifiable interconnects
randomly, so we are almost guaranteed that the first pass through
the network will generate the wrong output. We compare this output
pattern to the desired output pattern in order to evaluate errors
that are propagated backward through the layers of the network,
changing the weights of each layer as it passes.
This complete round of forward activation
propagation and backward error propagation constitutes one iteration
of the network. From here, we can present the same input pattern
to the network again, or we can modify it and present a different
input pattern, depending on what we are trying to accomplish.
In any event, we do complete iterations of the network every time
we present an input pattern for as long as we are training the
network.
Limitations of Backpropagation Networks
Backpropagation is computationally quite complex.
Many iterations, often hundreds or even thousands, are usually
required for the network to learn a set of input patterns. This
causes a backpropagation network to be a risky choice for applications
that require learning to occur in real time, that is, on the fly.
Of course, many applications do not require that learning occur
in real time, only that the network be able to respond to input
patterns as they are presented after it has been trained.
There is still more bad news about backpropagation
systems, however. Backpropagation, unlike the counterpropagation
system we will look at in chapter 15, is not guaranteed to arrive
at a correct solution. It is a gradient descent system, just as
the adaline was. For this reason it is bound by the problems of
a class of systems called hill-climbing algorithms-or rather hill
descent in this case. Readers familiar with AI jargon will have
heard this term before. The hill descent problem asks, "How
do you find the bottom of a hill?"
One commonsense solution is simply to always
walk downhill, which is exactly what gradient-descent algorithms
do. If you have ever tried this on a real hill, however, you know
that there is a hitch: you sometimes find yourself in the bottom
of a dip halfway down the hill and are forced to climb out of
the dip in order to continue downhill. If the dip has steep sides,
this commonsense approach to getting down the hill may not actually
get you there at all but may strand you in a dip part way down.
The feature corresponding to a dip in the gradient descent method
is called a local minimum.
Gradient-descent algorithms are always subject
to this problem. There is no guarantee that they will lead the
network to the bottom of the hill; they can be sidetracked by
local minima and end up stranded with no further means of arriving
at the bottom of the hill. In particular, although the adaline
had a smooth parabolic bowl for its error function, the complex
architecture of the backpropagation network has an equally complex
error function with the potential of having many local hills and
dips between the network's starting error and the desired minimum
error position. In practice, backpropagation systems have been
found to be remarkably good at finding the bottom of the hill.
Even so, nearly every researcher has found that some trials do
not work, and their backpropagation system fails to find the correct
answer. A great deal of research is being conducted to determine
how to identify such cases in advance or otherwise escape from
local minima.
One method modifies the delta rule still more
than the generalized delta rule does by adding a feature called
the "momentum term." Consider how this might work. A
sled moving down a snow-covered hill is a perfect example of a
gradient-descent system. It has no internal power to move itself
uphill unless the rider gets off and pulls it. However, a sled
can overcome small bumps or even short rises in its path if it
has generated enough physical momentum to carry it over such perturbations
and to allow it to continue in its original direction, downhill.
The momentum term in the modified delta rule works in the same
fashion.
Adding momentum to the weight change law is
easy. We just add a term to the existing: formula that depends
on the size and direction of the weight change in the previous
iteration of the network. To use our sled analogy, each new iteration
"remembers" the direction and speed it had in the last
iteration. If the algorithm finds itself in a local minimum, this
momentum term may make enough contribution to the formula to carry
the system out the other side so it can continue on its way "downhill."
This momentum term adds to the complexity
of an already tedious calculation. Do we really need it? The answer
is that we do not need it, but we may want it. Research suggests
that we can accomplish nearly the same result by making the weight
change for each learning step very small. Of course, this action
causes the network to require many more iterations to learn a
pattern, and including the momentum term is usually the more desirable
solution of the two.
The generalized delta rule is the most common
implementation of backpropagation; however, there are variations
on this scheme.
Variations of the
Generalized Delta Rule
Many researchers have offered variations on
the generalized delta rule theme. In general, these attempt to
decrease the number of iterations required for the network to
learn, to reduce the computational complexity of the network,
or to improve the local computability of the network.
...
Scaling Problems
There is one more serious drawback to backpropagation
networks: they do not scale up very well from small research systems
to the larger ones required for real-world uses. ...
This scaling problem restricts the applicability of backpropagation to problems that can be solved with relatively small networks. There are many such problems, however, and sometimes a collection of small backpropagation networks can be used to solve large problems. Also even small backpropagation networks can master surprisingly difficult tasks. ...
Biological Arguments against Backpropagation
In addition to these pragmatic difficulties,
backpropagation also faces other objections. One argument is that
this learning system is not biologically plausible. These critics
deem it unlikely that animal brains use backpropagation for learning.
One reason they believe this is that while our brains do have
reverse neural pathways-for instance, from the brain back to the
eye-these are not the same interconnects and synapses that provide
the "forward" activity. Recall that a backpropagation
system traditionally uses the same interconnects for both the
forward and backward passes through the network. A second and
more serious reason that some critics believe backpropagation
is not biologically plausible is that it requires each neurode
in the network to compute its output and weight changes based
on conditions that are not local to that neurode. Specifically
the neurode relies on knowledge of the errors in the next higher
layer of the network to compute changes in its own weights. Most
current research seems to support the idea that real synapses
change their weights in response only to locally available information
rather than relying on information about the activity of neurons
farther up the computational chain. This lack of reliance on locally
available information is thought by many to disqualify backpropagation
systems as serious biological models.
Recently researchers investigating this issue have suggested that there is a backpropagation formulation consistent with biological systems and that requires only locally available information at each neurode to adjust synapse weights. If this proves to be the case, some critics of backpropagation will be silenced. To many supporters of the method, however, nothing will be changed because they were never concerned that it did not possess biological plausibility. We suggest that biological plausibility need not be weighted too heavily in the development of neural network paradigms. It is true that biological systems are good models for network architectures; they furnish architectures we know will work. Except for researchers who are actually trying to model the brain, however, there seems little reason to reject an effective system just because it is unlikely to be an accurate model of biological systems.
pg. 194...
[So what is the bias of Samsara? Since it is built into the structure of the neural networks by which we transform the holoprocess into a Self that experiences a reality of objects and consciousness, it is almost impossible to conceive that the brain is a secondary processor whose context is not the reality out there but a primary holoprocess. This holoprocess is like a hypercube ("Wish fulfilling Gem") that the Self reduces to lower dimensional slices. So from the perspective of lower dimensional slices, we model the brains activities in adjustment of weights as being about "learning, memory and data acquisition", when in my model it is about creating and modifying neural network systems. Thus we are growing ways to create knowledge, not learning knowledge that is already present.
This next section is about other ways competition
and filtering can take place. Why all this fuss about inhibition
and winner take all? The driving force of human culture is the
skill of ignorance! The act of ignoring is what we call concentration
and what I call taking a slice in a lower dimension of a higher
dimensional object. Being able to inhibit and ignore is the basis
of separation from a chaos of unity of consciousness in the holomind
and the basis of an ego.]
pg. 202-205
The Counterpropagation
Network
The name counterpropagation derives from the
initial presentation of this network as a five-layered network
with data flowing inward from both sides, through the middle layer
and out the opposite sides. There is literally a counterflow of
data through the network. Although this is an accurate picture
of the network, it is unnecessarily complex; we can simplify it
considerably with no loss of accuracy. In the simpler view of
the counterpropagation network, it is a three-layered network.
The input layer is a simple fan-out layer presenting the input
pattern to every neurode in the middle layer. The middle layer
is a straightforward Kohonen layer, using the competitive filter
learning scheme discussed in chapter 7. Such a scheme ensures
that the middle layer will categorize the input patterns presented
to it and will model the statistical distribution of the input
pattern vectors. The third, or output layer of the counterpropagation
network is a simple outstar array. The outstar, you may recall,
can be used to associate a stimulus from a single neurode with
an output pattern of arbitrary complexity. In operation, an input
pattern is presented to the counterpropagation network and distributed
by the input layer to the middle, Kohonen layer. Here the neurodes
compete to determine that neurode with the strongest response
(the closest weight vector) to the input pattern vector. That
winning neurode generates a strong output signal (usually a +1)
to the next layer; all other neurodes transmit nothing. At the
output layer we have a collection of outstar grid neurodes. These
are neurodes that have been trained by classical (Pavlovian) conditioning
to generate specific output patterns in response to specific stimuli
from the middle layer. The neurode from the middle layer that
has fired is the hub neurode of the outstar, and it corresponds
to some pattern of outputs. Because the outstar-layer neurodes
have been trained to do so, they obediently reproduce the appropriate
pattern at the output layer of the network. In essence then, the
counterpropagation network is exquisitely simple: the Kohonen
layer categorizes each input pattern, and the outstar layer reproduces
whatever output pattern is appropriate to that category. What
do we really have here? The counterpropagation network boils down
to a simple lookup table. An input pattern is presented to the
net, which causes one particular winning neurode in the middle
layer to fire. The output layer has learned to reproduce some
specific output pattern when it is stimulated by a signal from
this winner. Presenting the input stimulus merely causes the network
to determine that this stimulus is closest to stored pattern 17,
for example, and the output layer obediently reproduces pattern
17. The counterpropagation network thus performs a direct mapping
of the input to the output.
Training Techniques and Problems
...
The Size of the Middle Layer
Now let's consider the size of the middle
layer not in the context of training issues but in terms of the
accuracy of the networks response. If we are trying to model a
mapping with 100 possible patterns, and we set up a counterpropagation
network with 10 middle-layer neurodes, then we can expect some
inaccuracies in the network's answer. It is not, by the way, so
straightforward as saying we will get 10 percent accuracy; we
might find a much higher degree of accuracy depending on how densely
packed the probability density distribution of the input pattern
data is. In the simplest case, the input patterns form a uniform
probability density function. If the data patterns are evenly
distributed throughout the unit circle, we expect that the weight
vectors will also be equally distributed after training. In the
two-dimensional case, each weight vector will have to cover about
36 degrees (360 degrees divided by 10 vectors) of the unit circle.
In other words, any input vector within this 36 degree arc of
the circle will generate precisely the same output. If we use
100 weight vectors to cover this 360 degree span, we would expect
that each weight vector will correspond to about 3.6 degrees of
arc, so any input vector within 3.6 degrees of a weight vector
will count as a hit. The situation becomes more complex if the
input patterns are not evenly distributed about the unit circle.
In this case the weight vectors are clustered during training
about the areas of the unit circle most likely to contain input
vectors. Regions outside the most common input areas may end up
with very few weight vectors in their region. Thus the occasional
input vector that occurs in one of these sparsely populated regions
may end up being approximated by a weight vector that is only
a gross estimate of the actual input vector. On the other hand,
input vectors that do occur in the area densely clustered with
weight vectors will be quite closely approximated. In reality,
a nonuniform distribution of input patterns is much more likely
in a real application than a uniform one. This means that the
accuracy of the network's mapping is better in those parts of
the unit circle that are more likely to contain input vectors.
Rather than having a uniform accuracy, counterpropagation networks
have higher accuracy in the more commonly used areas of the unit
circle and lower accuracy in the areas less likely to receive
input vectors. For many applications, this is quite acceptable
and possibly preferable. We still have not answered the question
of how many neurodes we need to have in the middle layer. We have
only indicated that the answer depends on how accurate the network's
output needs to be.
' The more neurodes we have, the more accurate our mapping can be. This is one of the key drawbacks to the counterpropagation network (and the Kohonen network as well), in fact, because real problems may well demand middle-layer sizes too large to build today. If we can afford to have only a limited number of neurodes, the mapping will still work, of course, but it may be less precise than we need. There is no hard and fast rule to apply to this question. As in many other situations with neural networks, the answer is: It depends. There is a way to improve the counterpropagation networks accuracy without requiring an unacceptably large middle layer: we can allow the network to interpolate its output. In other words, if we have trained the network to respond with a 1.0 to a blue input and a 2.0 to a red input, we train the network to output, say, a 1.5 to a magenta input. It is quite simple to implement this kind of interpolation in the counterpropagation network. All we have to do is to change the middle layer to allow more than one winner. For example, we might allow the middle layer to have two winners: the two neurodes with weight vectors closest to the input vector. In this case, the network's output will be a melding of the outputs from the neurode categories in the middle layer. If we want to broaden its response, we might allow three winners. We must be careful not to allow too many winners, or the output pattern will be too ambiguous to be useful. However, permitting multiple winners in the middle layer does give the network the ability to interpolate between known patterns.
pg205...
[The last excerpt from this book is adaptive
resonance. Resonance plays such an obvious part of normal human
existence that I look to all the different implementations to
explain such feelings as love, beauty, respect, awe and religious
worship to name a few. The "problems" that are listed
with these nets at the end of this chapter are to me only indications
that this model is very close to how we have constructed them
in our brains. The problem of "noise" in corrupting
a nets operation is not present in other nets that can "see
thru" noise to recognize degraded information in our consciousness
of sight and hearing. But when it comes to conceptualization within
the context of cultural shared wisdom, it seems humans play a
game of whisper, where a secret is whispered down a line of people
and never turns out the same when repeated by the last one to
hear it. Also the problem of fineness of adjustment fits my model
of levels of resolution and learning which requires finer and
finer levels of adjustment before humans acknowledge mastery of
human skills. We do not (usually) object if a person is more or
less skillful at walking or looking at a landscape, but we expect
a high level of skill for a doctor or electrician.]
pg. 207-223
The ideas of adaptive resonance theory can
be confusing initially, but the effort expended in understanding
them is well spent. ART supplies a foundation upon which we may
eventually be able to build genuinely autonomous machines. These
networks are as close as anyone has yet come to achieving the
goals for the autonomous systems listed in chapter 12.
The Principle of
Adaptive Resonance
We can best present the basic idea of adaptive
resonance with the two-layer network shown in figure 16.1. The
broad arrows in the figure are a shorthand way of indicating that
the network layers are fully interconnected with modifiable synaptic
weights. Although it will not be important to our immediate discussion,
let's assume our net uses the outstar learning model.
Each pattern presented to the network initially
stimulates a pattern of activity in the input layer. We call this
the "bottom-up" pattern; it is also called the "trial"
pattern. Because of the outstar structure, this bottom-up pattern
is presented to every neurode of the upper, storage layer. This
pattern is modified (in the normal weighted-sum fashion) during
its transmission through the synapses to the upper layer and stimulates
a response pattern in the storage layer. We call this new activity
the "top-down" pattern of activity; it may also be called
the "expectation" pattern. It generally is quite different
from the bottom-up pattern. Since the two layers are fully interconnected,
this top-down pattern is in turn presented (by the synapses on
the top-down interconnects) back to the input layer.
We can think of the operation of these two
layers in another way. The basic mode of operation is one of hypothesis
testing. The input pattern is passed to the upper layer, which
attempts to recognize it. The upper layer makes a guess about
the category this bottom-up pattern belongs in and sends it, in
the guise of the topdown pattern, to the lower layer. The result
is then compared to the original pattern; if the guess is correct
(or, rather, is close enough as determined by a network parameter),
the bottom-up trial pattern and the top-down guess mutually reinforce
each other and all is well. If the guess is incorrect (too far
away from the correct category), the upper layer will make another
guess. Eventually either the pattern will be placed in an existing
category or learned as the first example of a new category. Thus,
the upper layer forms a "hypothesis" of the correct
category for each input pattern; this hypothesis is then tested
by sending it back down to the lower layer to see if a correct
match has been made. A good match results in a validated hypothesis;
a poor match results in a new hypothesis.
If the pattern of activity excited in the
input-layer neurodes by the top-down input is a close match to
the pattern excited in the input layer by the external input-if
the guess is correct, in other words-then the system is said to
be in adaptive resonance. The ART systems that we will describe
are built on this principle. We will see, however, that we must
introduce several complexities into this basic scheme in order
to make a working neural network design.
[The statement "the upper layer forms
a "hypothesis" of the correct category for each input
pattern; this hypothesis is then tested by sending it back down
to the lower layer to see if a correct match has been made"
and the others in this book that I have previously noted confirms
what my "mind of enlightenment" has been pointing to
since 1958. Since my social / cultural mind is that of a scientist
who became involved in the Arts, I have not expected to get any
results or support for publishing this until There is the kind
of support provided by research of the last 20 years. I could
not have put this new vision into any but mythic language prior
to the beautiful work outlined here in this chapter. But beyond
this work, I see the connection with the cosmos as a meta model
that can form all "hypothesis", but forms specific "hypothesis"
for each individual and other hard wired species specific "hypothesis".
Other species are restricted in their ability to form "hypothesis",
whereas humans can use language to simulate the category of "hypothesis".]
Before we go on, we need to make a note of the internal architecture of these two layers. Recall that the lower layer is devoted to processing the input pattern and achieving adaptive resonance and that the top layer is devoted to pattern storage. In our discussion of the adaptive resonance principle, we concentrated on the interconnections between the input and storage layers. We now need to concentrate on the interconnections within layers, the intralayer connections, of the input and storage layers.
For the minimal ART 1 structure we will consider
here, the nodes of the input layer are individual neurodes connected
into a competitive internal architecture of the type presented
in chapter 7. To simplify the discussion, only one winner will
be allowed. In general, multiple winners can be permitted, but
nearly all ART systems actually built force a single winner. The
rivalry for activity inherent in the competitive structure is
essential to the adaptive resonance process. (While this single-winner
strategy is not inherent in the design of the network, it is much
easier to implement and operate than a multiwinner strategy. ...)
...
Operation of the ART l Network
Let's move on to the nitty-gritty of the network
and see how an ART 1 network operates in detail. Such an exploration
will tell us much about the delicate balances necessary to build
a good hybrid network.
An external pattern presented to the input
layer causes some of the nodes in that layer to become active.
Because ART 1 allows only binary inputs and the neurodes have
binary output functions, this pattern will be identical to the
external input pattern. Nevertheless, for consistency of discussion
we call the pattern that becomes active in the input layer the
"trial pattern."
Each of the neurodes in the input layer is
connected to each of the neurodes in the upper, storage layer.
By this means, the pattern of activity in the input layer is transmitted
to the storage layer. In the process of moving across the synaptic
junctions between the two layers, the pattern is modified so that
the activity generated in the storage layer differs from the trial
pattern.
The pattern arriving at the storage layer
signals the beginning of a competition among the nodes in this
layer. The winner of this competition generates an output signal
and all others are suppressed. The resulting output-layer activity
pattern is called the "expectation" pattern. For the
moment we will assume that this pattern consists of exactly one
active node because we have designed it so that only one node
can win the competition. This expectation pattern is, of course,
transmitted (via the top-down synaptic junctions) back to the
input layer.
[ The "expectation pattern" structure is the start of no-choice Fate and Karma viewpoint of self fulfilling predictions. This is the kind of backpropagation that exists in human "Software" networks constructed in language and culture.]
Everything we have said about the interconnection
of the input layer to the storage layer is true of the return
path from the storage layer to the input layer. Because the expectation
pattern must pass through the synaptic junctions, it will be modified
en route; thus, the pattern of activity generated in the input
layer will be quite different from the expectation pattern itself.
The pattern generated by the expectation pattern in general will
involve a number of nodes in the input layer.
The input layer now has two inputs presented
to it: the external input, which originally excited the trial
pattern, and the top-down expectation pattern. These merge and
generate a new pattern of activity, which replaces the old trial
pattern. If this new and the original trial pattern are very similar,
the network is in adaptive resonance, and the output of the storage
layer becomes stable. The corresponding expectation (top-down)
pattern is the stored symbol, or icon, for the external input
pattern presented. We have not yet considered all possibilities,
however. What if the expectation pattern excites a pattern in
the input layer entirely different from the original trial pattern?
This will happen, for example, when the ART network is presented
with a pattern very different from any it has yet seen. Such an
input does not match any of the network's "known" patterns
and results in a mismatch between the trial and expectation patterns.
The ART network must be able to cope with the novelty arising
from unusual patterns. We have already seen that its method of
coping is hypothesis testing. In order to understand how it actually
implements this strategy, we need to add a subsystem to the basic
two-layer system of figure16.1.
The Reset Subsystem
Figure 16.2 is nearly identical to figure 16.1, with the exception of a subsystem we call the "reset unit." The reset unit has two sets of inputs: the external input pattern and the pattern from the input layer. The structure of the reset unit depends on the details of the ART 1 system being considered. For our network, we can assume that it is a single neurode with a fixed-weight, inhibitory input from every node in the input layer and a fixed-weight, excitatory input from each external input line. The reset unit's output is simple; it is linked by a fixed-weight connection to every node in the storage layer. By now, you know another way of saying this: the reset unit is the hub of an outstar whose border is all the nodes of the storage layer. There are no reverse connections from the storage layer to the reset unit. Before we go on, we must fully understand the structure of the nodes in this storage layer. We have briefly mentioned that these nodes are little groupings of neurodes called dipoles or toggles. Let's explore what that implies. These toggles have several useful properties. First, they act just like an individual neurode most of the time. Second, and the one of interest to us here, is the property called "reset." Reset is the process of persistently shutting off all currently active nodes in a layer without interfering with the ability of inactive nodes in that layer to become active.
The reset action of a toggle can be summarized
in two statements: (1) If the toggle is active and it receives
a special signal, called a "global reset," then it will
become inactive, and it will be inhibited from reactivating for
some period of time. (2) If the toggle is not active when it receives
a global reset, then it is not inhibited from becoming active
in the immediate future. With these two characteristics, it should
be evident that sending a global reset signal to every toggle
in the storage layer will shut off only the currently active toggles
and furthermore will prevent only those toggles from becoming
active in the immediate future.
In general, we can think of the storage layer as simply being made of special nodes that have this reset characteristic. It is not especially important here whether we call them toggles, dipoles, or nodes. It is important to realize, however, that no matter what we call them, they act as outstars and instars, just as an individual neurode does in any network. Now we are ready to see what happens if the network does not reach adaptive resonance that is, if the original bottom-up and the newly generated bottom-up patterns do not match. If they do match, the two inputs to the reset unit (one from the external pattern and one from the input layer) balance, and it produces no output. If the original and the new trial patterns do not match, the activity of the input layer temporarily decreases as its nodes try to reconcile these two patterns. In fact, for reasons we will see, we can be absolutely guaranteed that if the bottom-up and top-down patterns do not match, the net activity (that is, the total number of active nodes) in the input layer will always decrease. As a result, the inhibitory input to the reset unit no longer exactly balances the excitatory input from the external pattern, and thus the reset unit becomes active. The active reset unit now sends its global reset signal to the nodes of the storage layer. Because these nodes are toggles, this reset signal causes any active nodes to turn off and stay off for some period of time. This destroys the pattern then active in that layer and suppresses its immediate reemergence. With the old pattern destroyed, a new pattern of nodes is now free to attempt to reach resonance with the input layer's pattern. In effect, when the old and new trial patterns do not match, the reset subsystem signals the storage layer that that particular guess was wrong. That guess is "turned off," allowing another one to take its place. The cycle repeats as many times as necessary. When resonance is reached and the guess is deemed acceptable, the search automatically terminates. This is not the only way a search may terminate: the system can end its search by learning the unfamiliar pattern being presented. As each trial of the parallel search is carried out, small weight changes occur in the synapses of both the bottom-up and the top-down pathways. These weight changes mean that the next time the trial pattern is passed up to the storage layer, a slightly different activity pattern will be received providing a mechanism for the storage layer to change its guess. If a match is quickly found, the amount of modification of these synapses is insignificant. If the system cannot find a match, however, and if the input pattern persists long enough, the synapse weights eventually will modify enough that an uncommitted node in the storage layer learns to respond to the new pattern. This also explains why the storage layer's second or third guess may prove to be a better choice than the original one. The small weight changes ensure that the activity generated by the bottom-up pattern in the second pass is somewhat different from the activity generated in the first pass. Thus a node that was second-best the first time, may well prove to be the best guess the second time. If the input is a slightly noisy version of a stored pattern, it may require a few synaptic weight changes before the truly best guess can be matched.
[In real social world contexts, my model provides complementary levels of resolution as a decision structure. The patterns are at different scales of a fractal with greater or less "detail" attached or ignored. If the middle layer of internal representation is at "x" level of resolution and the input pattern is at "y" level, and the patterns do not match, then instead of looking for a matching pattern, the middle layer can change the input layers resolution level until a close match is found. It will also move to other vectors of the fractal until the features left after the ignorance function of that vector have a best match. Here I model the star of David with each side generated by a different subset of the whole and each scale having more or less "selected" detail. Thus humans can "force" most any input to match their internal model and become agitated by the remaining mismatch as a threat. Hence the model of sin and Bad people etc. This is how humans can construct a global model which can fit any situation: a general theory or unified model. This is clearly what happens in Astrology and other prescientific "religious" models. Yet the very dysfunction of these models lead to the scientific revolution and was a "first step" that can now be discarded.]
We can also supply ART 1 with the property
of vigilance. This means that the accuracy with which the network
guesses the correct match can be varied. By setting a new value
for the reset threshold, we can control whether the network fusses
with trifling details or concerns itself only with global features.
Because of the way vigilance is defined, a low reset threshold
implies high vigilance and thus close attention to detail, while
a high threshold implies low vigilance and a more global view
of the pattern. By controlling the threshold of the reset unit,
we thus govern what the system calls "insignificant noise"
and what it identifies as a "significant new pattern."
We can interpret vigilance in another way. In effect, by setting the threshold of the reset unit, we choose the coarseness of the categories into which the system sorts patterns. A low threshold (high vigilance) forces the system to separate patterns into a large number of fine categories, and a high threshold (low vigilance) causes the same set of patterns to be lumped into a small number of coarse categories.
[Describing my model of levels of resolution
with emotional terms like "vigilance"!!]
The Gain Control Subsystem and the
2/3 Rule
We still do not have an operational ART 1
system. We must provide a way for the input layer to tell genuine
input signals from spurious top-down signals that might be present
even when no real-world input is being presented. Such a situation
would exist, for instance, if some random system noise or other
extraneous inputs activated the storage layer even when no external
input was present. We must also make sure that a genuine external
input always creates a pattern in the input layer in order to
start the adaptive resonance process. Furthermore, we have not
yet justified the assurance we made that the input layer's total
activity is guaranteed to decrease in the event of a mismatch.
...
Ideally the external input pattern is presented
to the input layer, the gain control system, and the reset system
more or less simultaneously. The gain control system turns on,
providing the second necessary source of stimulus to the input
layer and in turn allowing that layer to become active and generate
the original trial pattern. In the meantime, the external input
has also turned on the reset system, which shuts off any active
pattern in the storage layer. The input layer's activity is translated
into a bottom-up pattern and sent to the storage layer. In addition,
it goes to the reset system where it matches the external input
and shuts off the global reset. This combination of actions allows
the storage layer to respond to the bottom up pattern.
The storage layer now generates a top-down expectation pattern, which it sends to the input layer. This same expectation pattern also is sent as an inhibitory signal to the gain control; as a result the gain control system turns off. This removes one of the input layer's two sources of stimulation, but because the input layer now sees the topdown pattern (the new trial pattern) as well as the external pattern, it has sufficient stimulation to stay active.
[More levels of resolution!!]
...
The 2/3 rule also keeps noise damped in the
network. Any activity in the storage layer keeps the input gain
control from exciting the input layer. If the storage layer is
firing, the only other available stimulus for the input layer
is the external pattern; if this is present, the input layer's
neurodes can activate. If the storage layer fires spontaneously,
without an appropriate external pattern being present, the 2/3
rule will not be met and the input layer will not be stimulated
into action. Noise from the storage layer thus immediately damps
out.
The input layer can also be the source of
noise and spontaneous firings. If this happens when there is no
external pattern to support it, the noise pattern gets transmitted
to the storage layer. But the topdown pattern will shut off the
gain control (assuming it was on), so that the input layer will
be left with only one source of input (the top-down pattern response)
and the noise will be once more damped out. This keeps the storage
layer from being bombarded with meaningless bottom-up signals
that do not correspond to real inputs. If this were not done,
the storage layer would constantly be learning nonsense, and its
stored patterns would not be stable. We have so far not addressed
the bias gain control shown in figure 16.3. Its function is to
allow the presence of an external input to predispose the nodes
of the storage layer toward activity even before receiving a trial
pattern from the input layer. It does this by applying a small
excitatory signal to the nodes of the storage layer when an external
input signal is applied to the network. In this way, the activity
of the system is correlated with, or paced by, the rate of presentation
of external inputs. In some implementations it also helps mediate
the competition of the nodes in the storage layer, enhancing the
activity of any node that gets an edge on its competitors and
suppressing other nodes.
Troubles with ART1
ART 1, even the simple version described,
is a pretty good system. It possesses several of the characteristics
listed in chapter 11 for a system or machine capable of truly
autonomous learning. It learns constantly but learns only significant
information and does not have to be told what information is significant;
new knowledge does not destroy already learned information; it
rapidly recalls an input pattern it has already learned; it functions
as an autonomous associative memory; it can (by changing the vigilance
parameter) learn more detail if that becomes necessary; and it
reorganizes its associative categories as needed. Theoretically
it can even be made to have an unrestricted storage capacity by
moving away from single-node patterns in the storage layer. This
leaves, however, at least one major unsatisfied criterion: a truly
autonomous machine must place no restriction on the form of its
input signal. Unfortunately, an ART 1 network can handle only
binary patterns. This limitation is built into the way the network
subsystems interact, implying that it is fundamental to this architecture.
This has two ramifications. First, it limits the total number
of distinct input patterns that an n-node input layer can allow
to 2n. This is an important limitation only for small networks.
Ten input nodes, for instance, can handle only 1023 distinct patterns.
One hundred nodes, however, allow input of over 1030 separate
patterns. The second ramification of binary input nodes is that
they place requirements on the type of preprocessing we must give
real-world input signals. Under some circumstances, this can be
costly in hardware complexity, hardware price, processing time,
and total system power. For an application in which the ART 1
system is used with equipment already operating in a digital mode,
however, this need not be a serious restriction.
[SO?? This so-called restriction may be the
very essence of fractal I Ching: binary structures of many dimensions
which seem to abound in the real non-language world!]
ART2
ART 2 removes the binary input limitation
of ART 1; it can process gray-scale input signals. The cost of
this fix, however, is a considerable increase in complexity. ART
2 systems proposed so far have as many as five input sublayers,
each with its own gain control subsystem. Further, each sublayer
typically contains two networks.
[Thus later evolution speaks of the 5 elements!
With this structure 5 elements allows fuzzy systems!]
There are two major reasons for ART 2's added
complexity. The first concerns noise immunity. The noise problem
of a network such as ART 1 designed to recognize binary patterns
is relatively minor. Individual pattern elements have a sharp
"yes-or-no" nature. Thus, to change one element of a
binary pattern, a noise signal must be of virtually the same magnitude
as the pattern element. The noise problem of a network designed
to recognize gray-scale patterns is much more severe, however.
Individual elements of such patterns do not have the sharp yes-or-no
nature binary patterns possess; instead they can take on a range
of values. Two patterns that differ by only one gray-scale value
in one element are treated as quantitatively different. As a result,
noise that changes only a single pattern element by one gray-scale
value may make the input pattern unrecognizable.
The second reason for ART 2's added complexity
is that alikeness becomes a much fuzzier concept with analog signals,
even with no noise present, because each input element can take
on as many values as are in the gray scale being used. Identical,
similar, different, and very different are ambiguous terms that
must be quantified by the network in some way.
Several ART 2 architectures have been designed
by Carpenter and Grossberg and by others. All of these systems
accept gray-scale inputs, and they differ mainly in the number
and function of the sublayers in the input superlayer.
Grandmother Nodes and ART
As they are usually presented, both ART 1
and ART 2 use "grandmother" nodes in their storage layer.
A grandmother node is one that alone represents a particular input
pattern. This winner-take-all feature does not appear to be essential
to the operation of either type of ART system. We have seen that
more than one winner may be allowed in a competitive architecture.
However, nearly all actual implementations of ART networks (both
ART 1 and ART 2) use this scheme. If the output of a system must
operate a relay or be viewed by a human, having a single output
node correspond to a particular class of input patterns may make
sense. If the output of the system is to be interfaced to other
neural networks or to a digital system, however, using grandmother
nodes may make no sense at all. Let's see why.
First, using grandmother nodes limits the
storage capacity of an ART system to the number of nodes in its
storage layer. The severity of this limitation becomes clear when
we realize that even with binary storage-layer nodes, the memory
capacity we could obtain if each combination of n nodes could
be used to store a separate binary pattern would be 2n. ...
A second way in which grandmother nodes may
be a problem is that they reduce the reliability of a system.
Failure of a single component in one node can cause the pattern
coded into that node to be lost. Although there are ways to compensate
for this danger, it is one we would rather not have in machines
that we expect to use in applications requiring high fault tolerance.
Third, restricting the storage layer to patterns
containing only one node limits the ability of the network to
represent a hierarchy of concepts or objects. To understand this,
we must look at the effect of pattern complexity not on how many
input patterns we can store but on the way those stored patterns
can be associated.
It may not be intuitively obvious, but it
takes fewer storage nodes to encode an input pattern representing
a high-level general concept or complex object than it does to
encode an input pattern representing a single specific concept
or object. The broad concept "tree" needs fewer nodes
to encode it than the concept "cherry tree." Treeness
must be part of the pattern for cherry tree, as must fruit treeness,
hardwoodness, deciduousness, and a host of other concepts needed
to distinguish cherry trees from other kinds and classes of trees.
"The cherry tree in my front yard," a specific object,
requires still more nodes to encode because its coding pattern
must carry the extra information required to identify it as a
particular tree yet still contain the subpatterns representing
"cherry tree" and "tree." Thus, a layer designed
to allow storage of hierarchical information must have the ability
to form patterns consisting of different numbers of nodes. A storage
layer that consists only of grandmother nodes simply will not
do. Are there cases in which we would want to use the same complexity
in both the input and storage layers of an ART network? Perhaps,
but not normally. We usually want to reduce the complexity, or
the dimensionality, of the input signal before we store in the
storage layer in order to let only the essential features of the
input pattern be preserved. That is one reason that we use a competitive
structure in the storage layer; allowing competition in that layer
guarantees just this outcome.
...
Changing certain parameter values as little
as 5 percent can have disastrous consequences for the network's
operation. Such fine-tuning requirements make clear that ART 2
has serious problems as a model of our fuzzy, imprecise biological
brains.
In fact, both ART 1 and ART 2 have one more
serious drawback as a biological model: the requirement that the
input and storage layers, as well as the reset and gain subsystems,
must be fully interconnected with each other. The connectivity
requirements implied by this, and comparisons to the structure
of the brain, make it unlikely that the brain widely uses an ART-like
architecture. When added to the grandmother node architecture,
ART 1 and ART 2 have some serious drawbacks as models of the biological
brain.
[Note to personal friends: I may not be able
to directly include all these quotes in a published version, but
in a CD ROM or network version I will include even more depth.
The point is to finish this book to express my ideas about this
material. We no longer need to apologize for inclusion when hand
held books will soon be a odd custom of a wonderful part of our
history. Anyway, I can include what is needed so the so called
beginner is not handicapped by not having these resources available.
Also, these "sources" from 1990 are already dated and
being replaced by new research.]
end excerpt from Naturally Intelligent
systems.
Biological Neural nets and their structuring logics are the central organizing principle of this book.
Since biological n.nets are self organizing, and cannot be "influenced" or supervised by outside "forces", any reference to creator gods, or influence of the stars, heredity, survival of the fittest or of the environment is seen as only culturally determined models of redundant cultural systems. This means that structures and elements of the cosmos must be developmentally incorporated into self organized systems to model the environment and new functions and skills. For instance, the development of 3 dimensional binocular vision needed a new dimensional computational apparatus. Since living systems use or alter successful structures to fit new circumstances, it follows that if previous levels of development incorporated the cosmos in predictive computational procedures, new procedures would follow this same line of development. The mathematics of chaos theory show that small permutations over the hundreds of millions of years of evolution of single celled life could be incorporated in awareness structures. This means that single cells may detect and respond to small energy changes undetectable by larger multi cellular organizations.. It is my contention that multi cellular organizations, whether in the size of protozoa or in animal organ systems, or in neural net or social networks can be based on the STRUCTURE of these small energy permutations. These structures themselves are chaotic and cyclical in nature which would tend to produce all forms of life structures, and not be confined to particular "grooves" and determinacy. [Doesn't bode well for the future of monotheism! of determinacy by God or survival of the fittest].
]
10 -1- 94: 8 am
Control as a byproduct of lateral inhibition on layers of competitive n.nets.
Control as self control needs the construction of a self to control, and a self that does the control and something that needs controlling. This control also points to the ability to concentrate using the procedure of ignoring anything and everything outside of some prescribed boundary. From the point of view of n.net surfaces with multiple competing models of behaviors, one model of the "self" is allowed to learn and inhibit or control all competing models. Thus the pattern of having a self that has very well defined boundaries can be described as a together self. All that means is that the other inner models are ignored in favor of the "rules" of a single isolated self. All input is channeled thru this self, and any solutions that self originate in other areas are ignored. These other solutions are perceived as feelings, intuitions, emotions and in general are labeled as disruptive.
[hear notes on tape.
include after explanation of fractals
Another property of self organizing neural
nets is that given networks in several different individuals with
each network having identical inputs and outputs, the different
nets organize the solution path differently yet come out with
identical solutions: they are chaotically structured, no model
of their organization is "correct" or standard even
though their behavior is identical. This has been traditionally
stated as "there are as many different ways to truth as there
are individuals". Since the end behavior of this "black
box" is identical across the culture, a single pointer called
"name/word" can be assigned to this network. Thus the
illusion of equating meaning with structure. Thus the study of
knowledge [epistemology] in and by humans that finds the "one
right way" for the [global] organization of social, religious
especially monotheistic, political, or learning institutions is
not based on the biology of the human or any other species. In
my model, such efforts are attempts by some individuals to isolate
the "One Mind" from its ability to self organize and
substitute centralized self control and bivalent "rationality".
Since this is impossible all that results is the creation of phase
filters that arbitrary select individuals as competitively more
suitable and assign greater value to some individuals among identical
biological functioning individuals due to there developmental
similarity to the "standard model" which may have been
developed in religious context or as emperors or other "hero
models".
Neural nets like traffic system in closed or limited access network of streets:
if one closes or changes the flow at one single
node, it changes the pattern of conductivity thru out the system
if the baseline is the pattern of flow and not individual signals!!
This models the holographic, distributed, nature of mind [harmony
is the message: not single objects] and models brain "activity"
as single neurons producing a whole pattern - image because of
their placement. Also the whole pattern flow of "Qi"
in the "meridians" as network pattern.
ccc13 The "experience" of emotion
is pattern! Alien language is harmonic cyclic discrete "words",
as is the genetic pattern and proteins as folded patterns. [See
what terry recognizes as pattern and information.
Neural Nets - "Naturally Intelligent System
Maureen Cadill - Charles Butler
The training method of competitive systems of culture goes to peaks and valleys and minimum. This applies to the level of incompetence as a local min. that each side as right - left, republican - democrat. Theses two sides tend to converge to a middle of stability , peacefulness. Status Quo - stagnation - stand still - #12. Going to divergence and conceptualizing Making objects and nominalizations.
XXX XXX
XXX US ECONOMY XXX
XXX XXX XX
XXXXXXXX X
XXXXXXXXXXXXXX
JAPAN
This is also a general system theory of emotion - competition brain theory sociological system - political and economic, i.e., cultural - trained supervision. Untrained self organizing are spiritual creative and artistic.
(Establishment).