Excerpts from Eric Baum's What is Thought?

Any typos, omissions, or errors are my own

3: Semantics is equivalent to capturing and exploiting the compact structure of the world, and thought is all about semantics.

11: Language is usually treated as pure syntax before one begins, so there is little hope of recovering the semantics.

13: The neural circuitry is, in my view, akin to an executable. The DNA is more like the source code.

14: For instance, if you go to a friend’s house and sit at her her table, how do you identify as a cup a collection of atoms that you have never before seen and that may have a different color, shape, and size than any other collection of atoms you have ever seen?

17: A typical research paper in computer science consists of applying one or more of these tricks to a novel problem. A breakthrough research paper finds a new trick.

22: …human thought is fast because we search only possibilities that make sense.

28: Many experimenters have found empirically that it is very hard to evolve programs that communicate usefully between modules (or even between a program and the computer’s memory) because one has a chicken-and-egg problem. The evolution cannot discover that it is useful to speak (or write) a word until some agent knows how to understand it and make profitable use of the knowledge, and it cannot produce an agent that can understand the word until someone is speaking it.

33: Turing visualized a mathematician sitting and working on proving theorems, writing on a big pad of paper. The mathematician might do three things. He might read what is on the paper, he might erase it and write something else, and he might think. Paper is ordinarily two-dimensional, but to simply the discussion Turing suggested considering the paper as one-dimensional, so that the mathematician would really just write on a long paper tape which Turing assumed was divided into squares.

41: [Equivalent computing systems:] General recursive functions, Lambda calculus, production systems.

51: So, von Neumann proposed that the machine he would copy would be analogous not only to the car but to the car and the algorithm for constructing it. This is a much easier thing to duplicate than just the car. You begin by copying the algorithm. This is easy — it’s a linear sequence of instructions. In fact, it is reducible to a Turning machine program, since I’ve argued that every algorithm is equivalent to a Turning machine program. Thus it is a sequence of 1s and 0s that can be copied straightforwardly. Then you follow the algorithm to copy the car. The already constructed car is superfluous, just a distraction. All you have to do is follow the instructions to get a copy of everything you began with.

52: Having thus designed the reproducing automaton (denoted here by A + B + C), von Neumann realized immediately one other thing: his theory immediately gives rise to a microscopic description of evolution. Again, I don’t think I can do better than to close the section with another quotation from von Neumann:

You can do one more thing. Let X be A + B + C + D, where D is any automaton. Then (A + B + C) + phi(A + B + C + D) produces (A + B + C + D) + phi(A + B + C + D). In other words, our constructing automaton is now of such a nature that in its normal operation it produces another object D as well as making a copy of itself…. The system (A + B + C + D) + phi(A + B + C + D) can undergo processes similar to the process of mutation… By mutation I simply mean a random change of one element anywhere… If there is a change in the description phi(A + B + C + D) the system will produce, not itself, but a modification of itself. Whether the next generation can produce anything or not depends on where the mutation is. If the change is in A, B, or C the next generation will be sterile. If the change occurs in D, the system with the mutation is exactly like the original system except that D has been replaced by D’. This system can reproduce itself, but its by-product will be D’ rather than D. This is the normal pattern of an inheritable mutation.

54: Why is information copied into RNA before further processing? One theory is that this structure arose as an artifact of evolutionary history. Because RNA can serve as a catalyst as well as carry information, it is believed that first life was purely RNA-based, involving no DNA. Thus, the mechanisms of life evolved using RNA. As creatures became more complex, evolution discovered that it could use DNA, which is chemically more stable than RNA, as long-term storage for the valuable information. But it continues to translate this information back into RNA to employ descendants of the previously evolved mechanisms for manipulating information in RNA.

55: Exons are usually swapped whole, surrounded by portions of introns. Thus, the exons can serve as building blocks, and crossover can serve to mix the building blocks in search of ideal combinations. If it weren’t for the introns, crossover would break the building blocks in the middle, resulting in many more inoperable programs. Suitably positioned introns thus plausibly render creatures more evolvable.

444: [of Michal’s 1998 3’ x 10’ Biochemical Pathways poster that “shows the entire flowchart of the metabolism (or, at least, all the portions that have been worked out)”] My next thought is that the fact that this flowchart can be shown on a plane — albeit the picture is not logically planar inasmuch as the lines go under or over others — shows that there is a compact, modular structure to the metabolism. My guess is that a random graph with nodes representing as many chemical products, but with edges drawn connecting random nodes, could not be comprehensibly drawn on a two-dimensional sheet of paper.

61: As development proceeds, it utilizes memory to control the program flow — memory, for example, in the form of DNA rearrangements and molecular modifications such as methylation patterns. Methylation occurs when a molecule called a methyl is attached to a protein or a nucleic acid. For example, some of the cytosines in a gene may be methylated, and which particular cytosines are methylated influence whether the gene is active. When a liver cell divides, it produces new liver cells, which implies that it remembers which genes should be active and which inactive in a liver cell. Other genes would be active in a neuron or a skin cell. An important way that this memory is stored is in the methylation patterns, which are carefully conserved when the DNA is replicated. Stem cells, which famously can develop into any type of cell, have all these memory mechanisms initiated blank.

All this logic is implemented in a Post-like production system. For example, a repressor, which is a protein that binds to DNA, represses the expression of a gene by matching a specific pattern on the DNA and attaching itself where it is matched. This then has some effect that suppresses expression of the DNA, such as covering up a nearby location on the DNA where a promoter might otherwise match to induce expression.

62: Development in all the bilaterally symmetric animals, including, for instance, insects and human beings, is controlled by closely related toolkit genes. Human toolkit genes code for proteins that are quite similar to those of very simple animals, and indeed most of our proteins are similar. The toolkit genes are sufficiently related that the expression of the mouse eye gene on the wing of a fly will cause a well-formed fly eye to grow there. Note that the mouse eye gene, when artificially inserted an expressed on the fly wing, causes a fly eye to form, not a mouse eye, so the semantics of the mouse eye gene depends on its context: what it says is “activate the pathway for making an eye,” or in other terms, “call the subroutine for making an eye,” and then the subroutine (coded elsewhere in the genome) creates the appropriate eye.

Because the genes and proteins themselves are so similar, most of the evolution in development form the simplest bilaterally symmetric animals, through insects, through human beings is thought to have occurred through evolution in gene regulation: additions, mutation, and deletions of promoter and repressor sites near genes, especially near the toolkit genes themselves. […] These kinds of changes can swap whole existing subroutines, calling them at different times, or slightly modify subroutines, essentially changing a few lines of code.

68: Evolution has been like a scientist questing for a simple explanation of the world.

76: The conclusion to be drawn from Searle’s argument is that how a computer program might come to contain semantics is a central question that we must address. Searle’s point 3 should not be an axiom but a research program: how can a syntactic system come to contain semantics?

80: If one plots random points, they won’t lie in a straight line; they will scatter all over the page. We cannot fit a straight line unless it really captures something about the data, unless its slope has some real meaning in the world. The fact that we are able to fit a straight line tells us that the process that produced the data is very special and very simple, and that the line has captured the simplicity, including the correct parameters.

85: [If you want to have a neural net spit out 1,024 numbers but only have a middle layer with 10 neurons, those neurons will come to represent the binary encoding of the number.]

86: This bottleneck concept is analogous to a small parameter curve’s fitting a large number of data points. In the neural net, the number of parameters is the number of adjustable weights. If we fit a number of data points that is much larger than the number of weights, we might begin to have confidence that the resulting net captures something about the real process generating the data.

92: There are 2^m^ different possible ways one could classify a collection of m data points as positive or negative examples without restricting the function class at all. If the function class is so large and powerful that for each of these possible ways of classifying the m points there is a function that computes it, then the function class is said to shatter the collection of points. In that case, finding a function in the set consistent with the classification of the data tells nothing at all about the data.

Now, the VC dimension of a function class is simply the size of the largest set of points it can shatter.

101: Say there are many possible outcomes and the probability that we actually see data D is P(D|H). If the different outcomes occur with different probabilities, we could encode most efficiently by reserving the short strings for very probable messages. That is why e, the most common letter in English, is encoded in Morse code as a single dot, the shortest string. This allows a Morse coder to send shorter transmissions on average.

[…]

Recall that Bayesianism says that we should decide which model is correct given some data by maximizing P(D|H_i)P(H_i). Maximizing this is, of course, identical to minimizing the negative log of this, that is, minimizing -log(P(D|H_i)) - log(P(H_i)). But MDL, minimizing the description length, just minimizes this negative log — the sum of the encoding length of the model plus the data given the model. So Bayesianism and MDL are just different ways of talking about the same thing; they calculate exactly the same quantity. The difference is only that MDL simply says to choose the simplest model consistent with the data, whereas Bayesianism gives a numerical estimate of how much the simplest model is to be preferred over competing, less simple models. In practice, the amount by which the simplest model is preferred is typically astronomical, so Bayesianism and MDL are essentially identical.

109: The simplest and most widely applicable heuristic, called hill climbing, is similar to the process of evolution itself. Take a candidate solution and evolve it to improve: mutate it, check to see if the mutation is an improvement, and if it is, keep the mutated solution.

111: It is clear why this hill-climbing procedure is not guaranteed to get the most optimal solution, for example, to give the best collection of weights for a neural network. The fitness landscape may have many peaks, some higher than others. The hill-climbing procedure climbs only one of them. When it gets to the top of a peak, it goes not further, even if there is a higher peak somewhere else. If it goes up the “wrong peak,” it will not find the optimal value.

117: Instead of discrete neurons that just sum their weighted inputs and take output value 1 for a positive sum and 0 for a negative sum, consider using neurons that vary smoothly through values between 0 and 1. For sums below some value, say -1, these neurons take value 0 and for sums above some value, say 1, they take value 1, but for sums with value between -1 and 1 their output varies smoothly, gradually growing larger as the weighted sum of their inputs increases. When you make this change in the neurons, the values of the output neurons now depend smoothly on the values of the weights and on the values of the input neurons, so it is now possible to make an infinitesimal change in the outputs of the net by making an infinitesimal change in the value of a weight. It then turns out that by using the methods of differential calculus it is possible to feel out in which direction the hillside slopes, so that one can head directly uphill in the steepest way possible.

118: With hill climbing, the parachutist lands somewhere and starts climbing.

120: One way in which biological evolution had incredibly vast computational resources is that it was able to evaluate each change in the context of the whole solution. This meant evaluating the fitness of a creature. Biological evolution made new, slightly changed creatures, and then evaluated how well they worked. To simulate that on a computer, one would have to simulate the interaction of the creature with the world, which implies simulating the world. Thus, biological evolution effectively did a great deal of computation on each creature and also evaluated a vast number of creatures. Human programmers will not be able to simulate biological evolution soon, if ever.

121: We can imagine that this kind of search can be extremely powerful. Instead of trying various meaningless base changes, almost all of which do nothing useful, evolution effectively searched over combinations of meaningful macros. Add long legs, and see if that helps. Try a shell, and see which is better: shell and long legs, just long legs, or just shell. Take the brain and scale it up, to see if that improves performance.

It doesnt’t seem that it would be possible to evolve a human from an ape in a million years with only a few hundred changes in the genome unless evolution had created a situation where changes had semantic meaning, such that a few-bit change might program growth of complex brain areas. Indeed, adding an extra copy of a single gene has been shown to make mice grow much larger brains, which then wrinkle like human brains to fit inside the mouse skull.

[…]

If the exons contain semantically sensible units, say, coding for protein domains that fold nicely, they could form useful building blocks. Then crossover might combine these building blocks, greatly speeding search for new, useful proteins. It’s as if, for the Traveling Salesman Problem, big blocks of tours — the best ways to travel around New England, the best ways to travel around the Midwest — were discovered and then searched over in order to determine how to combine them in the best way.

124: The problem with genetic algorithm approaches that simulate sexual recombination is that they usually abandon credit assignment. As mentioned, hill climbing does credit assignment by making small changes and evaluating the small changes in the context of the rest of the solution. this is the main reason why hill climbing is successful.

But when one changes half the components, as in the usual genetic algorithm approach, the fitness is largely randomize. Such a change is not a small change at all, so arguments about walking uphill don’t apply. This kind of change abandons credit assignment: it doesn’t slowly adjust the components so that they work well in the context of all the other components.

Biological evolution, by contrast, has largely learned to swap genes that correspond to real modular structure in the world.

138: Recall that the VC dimension is determined by the largest set one can shatter (map in all possible ways). But almost all these ways are highly discontinuous. The nets readily reached by back-propagation are ones that vary more smoothly. Neural nets have not been nearly as able to handle problems involving highly discontinuous decisions like those that might be made by complex algorithmic processes.

140: A second analogy, which is more clearly present and may be more important, is that biological evolution works in a context where something much like early stopping happens automatically. Each new creature is tested on new data (the world) and found wanting or not. Training thus proceeds to optimize performance on new data. That is, biological evolution tweaks the DNA program (the learning algorithm that develops the mind) and tests each new modification on new data. Because new data are always used, there is effectively no possible of overfitting. And because the criterion is always performance on the new data, evolution is evidently self-regulated. It must learn to use sufficiently compact representations. It can exploit only as much computational flexibility in whatever representation it uses as is consistent with acting correctly in new situations.

145: What thought is ultimately about is behavior, taking actions.

157: We can look at pictures of random objects: the Eiffel Tower, Albert Einstein’s face, a dog, a stone, and identify them within tenths of a second, after enough time for a neural cascade at most a few dozen neurons deep. [100 serial ops. per sec.]

165: Moreover, this compact description can be used to do rapid calculations. You can, for example, exploit the structure of the natural numbers to decide rapidly whether any given number, say 128933982023845, is evenly divisible by 5. Of course, you do not personally reason explicitly from the axioms, which you may not know, when you decide whether some number is divisible by 5, or when you decide whether your restaurant bill is correct, or generally whenever you use numbers. However, you use the tricks you have learned or figured out that can be derived from these axioms, tricks that are implied by the axioms, tricks that work because of the way the axioms constrain the natural numbers, tricks that exploit the compact structure of the natural numbers.

168: The description of the world in terms of objects and simple interactions is an enormously compressed description.

170: If the world weren’t really organized into objects, or if there weren’t objects in the world like cups, no compact description of the data would exist that had such objects in it.

171: Most of our knowledge about cups is thus knowledge about how to use them.

181: Note that this recursion module acts on program space, that is, it outputs a program for solving problems. It does not act directly on the problem space (the space of block configurations) and output a sequence of block transfers.

182: In this way, relatively large computational resources are brought to bear on the problem of constructing new algorithms: the computer science community (tens of thousands strong) searches in parallel for new techniques, and when one is found adds it to the communal program through teaching. Evolution does something similar, searching through mutation in parallel over millions or billions of different creatures of a given species and incorporating the successful mutations in future generations.

187: Computer scientists addressed this problem [chess] by bringing to bear two of the tricks in their bag: evaluation functions and branch-and-bound.

188: Rather, Deep Blue cut off its search after 11 or so moves deep and estimated the value of the position there using an evaluation function crafted by human programmers.

[…]

The evaluation function used was some simple, readily computable function of the position. Chess evaluation functions include an estimate of material balance, adding 9 for the queen, 5 for each rook, 3 for each knight, and 1 for each pawn, and subtracting the same for the opponent’s pieces. Using material balance alone, with deep search, is already enough to play pretty good chess, but programmers generally add a few more terms. To material balance they sometimes add a measure of mobility: how many moves can one make from the position. A few simple positional terms may be added, for instance, a king safety term: add 1 if one is castled and there are pawns in front of the King.

189: However far it searches ahead, the computer can thus find some totally braindead line where it always captures with its last move and thinks it is ahead, ignoring the fact that its opponent could immediately recapture if it looked one further move deeper. This is known as the horizon effect. Fortunately for computer chess programs, the horizon effect can be substantially mitigated in practice by extending search along captures or checks and only evaluating positions that are reasonably quiescent (where there are no advantageous checks or captures).

191: Note that branch-and-bound exploits specialized knowledge about game trees but no knowledge whatsoever special to chess. It is a general-purpose approach capable of speeding up search on any game tree that has leaves labeled with values. This is the sense in which Deep Blue’s approach is brute force. It doesn’t even know it is playing chess except inasmuch as a human programmer coded in an evaluation function chosen to be relevant specifically to chess.

[…]

How, by comparison, do people play chess? We don’t know for sure, but introspection yields clues. Human chess players also look ahead in an attempt to decide what move to make. But they seem to search a very limited tree. Grand masters claim to mentally examine only some 100 or so positions before moving (Kotov 1971). Some of these positions may, however, be 35 moves or more deep in the search tree.

192: By looking at a position deep in the tree, human players gain insight into how some thematic idea will play out.

The computer treats the game as a game tree of otherwise unconnected positions. The human player has an understanding of the game that seems rather more profound. She decomposes the position into interacting themes and concepts. She may recognize that there is some particular stable feature of the position, for example, that because of the nature of the pawn structure and the opponent’s remaining pieces, the opponent can only weakly defend some particular squares. The human player may then look far ahead with the goal of exploiting these weak squares, an analysis that will be exceedingly narrow because it ignores many possible moves not seen as related to this goal. Such an analysis may yield information so the importance and realizability of the gaol, which are relevant quite aside from any possibility that one player or another might actually transpose into the line analyzed.

194: …only examine moves causally related to the checkmate…

197: The computer approach I have discussed for chess cannot deal with Go because there can be 300 or more legal moves from a given Go position, so the search cannot go very deep, and more important, no one has any good idea how to write down a simple evaluation function estimating the value of a Go position.

201: But it is subjectively clear that virtually every Go move is made because of specific causal effects the player perceives it will have on one (and for stronger players, almost always more than one) goal, such as strengthening a group of denying the opponent some territory.

[…]

A simplified form of these quantitative questions has been formalized and analyzed by Conway (1976) and others as a mathematical theory of games. Conway analyzed sums of unconnected games: you and I might, for example, play a sum of three games of chess, two of checkers, and one of tic-tac-toe. When it’s your turn to move, you pick one of the games and move in it. When it’s my turn, I do likewise. So you might choose to move twice in a row in the chess game, while I move once in the tic-tac-toe and once in the checkers. At the end, we might add the scores up among all the games to see who wins.

205: I suggest that the human mind is engaged in a difficult optimization. The production of programs that understand requires massive computation that human programmers simply are not capable of. AI researchers may someday produce programs that understand, but these programs will themselves be produced by execution of programs using vast amounts of computation to learn and optimize rather than being directly programmed by people.

A hallmark of human analysis is that it is causal. People expect objects to interact in a causal manner, and they only analyze moves to the extent that they perceive them as potentially able to cause desired gola,s such as making a group live. No one knows how to reproduce this in computer programs.

210: One reasonable possibility is that the intractability was created by the academic division itself, that the subproblems are hard because we have thrown out too much of the structure o the problem in dividing it up, whereas the full problem may be solvable. In fact, computer scientists have constructed rigorous mathematical models that show this is possible. Khardon and Roth (1994) have constructed models in which reasoning is provably NP-complete and learning is probably NP-complete, but learning to reason is tractable.

215: Computer programmers have been surprised at the longevity of programs: witness the alarm over the Y2K bug, which arose in part because programmers in the 1970s had no idea their programs would still be in use 30 years later.

[…]

There are individuals who, after suffering a localized stroke, can write but not read. There are individuals who lose most memory of living things but not nonliving things, and vice versa.

217: The Wason selection test [used to show that we have a strong cheating detection module.] …we are fast and accurate at solving the problem only if it involves verifying social obligations. In some versions of the game, the fraction of people who get the right answer is changed dramatically by just inserting the word must into the problem description. Must invokes social rules.

225: We repeatedly understand abstract concepts in terms of physical, concrete ones, and often in terms of concrete spatial reasoning: how we see objects move, how e move around, orientation (healthy people stand up, dead ones fall down), and so on. Such spatialized metaphors, grounded in the body, are exactly what could be expected if the human mind evolved out of simpler minds, starting with minds for controlling single-celled creatures, moving on to minds for controlling slugs, and so on through monkeys and people. Many of these spatial computations must have been done by our prelingual ancestors. Bacteria, slugs, and invertebrates all need to behave in three-dimensional space. Spatial concepts and reasoning have no doubt become more sophisticated through evolution, but to a large extent their evolution, like all evolution, was presumably a gradual process. And as creatures evolved to deal with more abstract concepts, what could be more natural for them to build on the existing computational structures they had?

Lakoff and Johnson [in Metaphors we live by] give many examples…

231: …how visual cortex and auditory cortex are closely related and, at least in the ferret, develop their differences only because of the different sensory stimuli they receive.

238: The question posed is, What good is half an eye? If it is no good at all, the eye cannot have evolved, because to have it occur all at once in one big mutation would be an incredibly rare event and would never happen in the age of the universe. The answer given by evolutionists is that half an eye does not meant the lens and light-sensitive neurons but no cortex; it means a poorly evolved eye that functions half as well: perhaps a light-sensitive neuron connected to muscles, with no lens or cortex, but still something functioning to sense the world and produce an appropriate action, more appropriate than could be produced without the primitive eye. Once some advantage is gained by having even a poor eye, the feedback allows hill climbing, a long sequence of small mutations each conferring some slight further advantage, which can gradually produce a more powerful eye.

240: If we can credit each module for its contribution, then we need no longer address the full problem of learning to solve Blocks World instances but can simply let each module evolve to maximize its own reward.

241: To take a well-known example (Read 1958), consider going into a stationery store to buy an ordinary number 2 pencil. How did the pencil come to be in the store? In fact, there is no one person on earth who knows how to make that pencil. There are lumberjacks skilled at cutting down trees, chemists who know all about how yellow paint is produced, miners in Ceylon who know how to extract graphite, smelters who contribute to the brass ring holding the eraser on, and farmers in Indonesia who grow a special rape seed for oil that is processed into the eraser. These people don’t know each other. They have no common language or purpose. Apparently, Adam Smith’s invisible hand organizes their widely distributed knowledge and efforts to perform a computation, the mass production and distribution of pencils, that would be called cognitive if an individual human being were capable of it.

242: What rules can be imposed so that each individual agent will be rewarded if and only if the performance of the system improves? The answer, I suggest, is that two simple rules are critical.

The first rule is conservation of money: what one agent pays, other agents get. Money neither vanishes in transactions nor is created, except that there will be pay-ins to the system from the world. The second rule is property rights: everything is owned by some agent, and all agents’ property rights are strictly respected. No property of any agent is trespassed on, unless the agent consents.

245: The Tragedy of the Commons occurs in the ecosystem as well. For example, consider the forest. The vast majority of the biomass in the forest is in tree trunks, the sole purpose of which is to put one tree’s leaves higher than another’s. The reason for this arms race is that the sun has no owner; sunlight is a resource held in common. If the sun had an owner, with the right to dispose of the sunlight as she saw fit, building a trunk would be irrelevant. The sun’s owner would want to be paid for the resource. She would action off the sunlight to the highest bidder. Money and sunlight would trade hands, but there would be no wasted investment in tree trunks. From the point of view of efficiently extracting wealth from the world, grass is a more efficient solution.

246: A program cannot hope to form long chains of agents unless conservation of money is imposed and property rights are enforced. To evolve a long chain, where the final agent achieves something in the world and is paid by the world, reward has to be pushed back up the long chain so that the first agents are compensated as well. But pushing money up a long chain will not succeed if money is being created or is leaking, or if agents can steal money or somehow act to exploit other agents. If money is leaking, the early agents will go broke. If money is being created or being stolen, the system will evolve in directions to exploit this rather than creating deep chains to solve the externally posed problems. Solving the world’s problems is very hard, and an evolutionary program will always discover any loophole to earn money more simply.

248: Cells, for instance, cooperate because each has exactly the same interest: each cell contains exactly the same DNA. Since biological evolution is all about propagating DNA, this effectively means that each cell has the same interest. What is good for the DNA of the entire creature is good for the DNA in the cell, so DNA in the cell that promotes the reproduction of the creature is selected for. However, one still sees cheaters arising from time to time in cells. What happens when a cell ceases to cooperate is called cancer. An elaborate enforcement mechanism has evolved to try to prevent cancer, for example, by identifying and destroying cancerous cells.

249: …mitochondria (thought to be originally bacteria that invaded the cell and then evolved to cooperate symbiotically)…

250: [Description of how the Hayek machine works. See paper as well.]

257: …But until the system can solve one-block instances, it gets no feedback at all from the world…

262: We thought that Post production systems might be a good language for evolutionary programming because they are universal and because they build in pattern matching. Pattern matching is built in as follows. When we look at a production to see if its condition matches the current conclusion, we have to scan along the current conclusion to see if there is any place the variables can be made to match. This involves a search for patterns: if the condition has the pattern red blue red we have to scan down the string to see if that pattern occurs anywhere. We had some experience that said that pattern matching is useful for constructing programs. Giving it to the system as a primitive, built into the machinery that we supply before evolution even starts, thus seemed likely to give the evolution a head start.

265: Notice that these four agents are all profitable (given the size instances we were presenting). Agent A, which closes, bids the most. B always precedes A and so gets paid a large margin when A wins. Agent B always follows Agent C and bids higher than it, so C is profitable as well. and either B or C always follows D, so D is profitable, too.

286: This is not at all atypical of NP completeness arguments. They often (or perhaps always) rely on some analog of a gizmo, and they all map into structured, highly atypical examples of the class they are mapping into. Thus, we can map all the problems in NP into a tiny, very special subset of the graph-coloring problems or of the TSP problems or of the satisfiability problems.

I suggest that this shows how little structure the different graph-coloring problems have in common. They are sufficiently unstructured to be used to encode many diverse problems that no one would expect to have a sufficient common structure to exploit for a rapid solution.

287: So, all we really need in order to produce intelligence is to find a good enough solution, and only a good enough solution for those cases that arise in practice. We do not need a solution that always works; indeed, it is clear that our minds cannot solve every problem and that evolution has not solved every problem in producing us. To produce understanding the Occam results do not require the most compact representation; they merely require a representation much more compact than the size of the data. Thus, there is some reason to hope that the difficulties implied by NP-hardness won’t arise.

290: However, if one does not attempt to divide the problem into classification and control subproblems but rather simply evolves the control net to perform a behavior such as avoiding the wall and staying near a cylinder, the net learns easily.

291: This difficulty in classifying cylinders is an example of a more widely known problem called perceptual aliasing. Often one does not seem to have enough sensory data to distinguish exactly the state of the world. But this example shows that perceptual aliasing can often be solved by active sensing, where one takes actions in the world, such as moving around and sensing from different locations, or even acts on the world to see what happens. And the problem can often be totally avoided by evolving behavior that avoid confusing situations.

…They succeeded at this because they were able to choose in which states to trigger the grip action: they had evolve to grip only in states where they could confidently and correctly recognize cylinders.

297: The suggestion that propagation of constraints in related fashion is important to human intelligence is an old one. One of the earliest clear statements of this idea is due to David Waltz, in his 1972 Ph.D. dissertation (Waltz 1975). He wrote an algorithm for analyzing line drawings of three-dimensional objects such s cubes and polygons. Given a two-dimensional drawing showing the edges os such objects and the edges of the shadows they produce, Waltz’s algorithm used constraint propagation to rapidly produce a correct labeling of the figure showing the three-dimensional structure (see Figure 11.5).

Waltz’s procedure works by attempting to label each edge as one of the following: it is a crack in an object, the boundary of a shadow, or the boundary of a convex object, or it bounds a concave object. The labeling further indicates illumination information: the bounded surface is directly illuminated, shadowed by an object, or self-shadowed by virtue of facing away from the light source. Putting the illumination information together with the edge-type information gives over 50 possible edge labels. Waltz made as well a catalogue of possible junctions of edges: edge lines can meet at an L, at a T, at a fork, and so on, through a list of about ten kinds of intersections. Waltz further exhaustively catalogued the ways the edges coming into a junction can be labeled. And here is the key point: the number of junction labelings that are physically possible the the junction is to appear from a line drawing corresponding to a real three-dimensional collection of objects is much smaller than the total number of possible junction labelings one could imagine. For example, one could imagine that either of the two edges in an L junction could be labeled in any of 50 possible ways, so that L junctions could be imagined to have 2,500 = 50^2^ possible labelings. But most of these are not physically possible; if the edge coming in from one side bounds a convex, shaded object, then the edge coming in from the other side also does. In fact, there are only 80 legal types of L junctions.

299: Indeed, it seems plausible that some kinds of constraint propagation similar to those proposed by Waltz and by Blum underlie many of our thought processes. We don’t engage in huge breadth-first searches when we think; our thoughts run along carefully pruned, highly likely paths only. we don’t, for example, look at all possible paths in a chess game, as computer programs do. WE don’t look at all possible next actions when we plan. We jump to consider only certain alternatives. Such an ability to suggest plausible lines of thought is reminiscent of Blum’s algorithm: at any given time what we have already figured out constrains us to consider only one or at most a few lines of thought in continuing to analyze the world or to compute what to do next.

Several lines of evidence support and clarify this picture in regard to language understanding. First, there is the difficulty we have in understanding what are known as garden-path sentences because they lead the listener down the garden path and lose her there. Examples, taken from Steven Pinker’s book The Language Instinct, include “the horse raced pas the barn fell” or “fat people eat accumulates.” These sentences are hard to understand but have perfectly valid interpretations… Apparently, the problem here is that we jump to a conclusion early in the sentence that later proves to be incorrect. When we see “the horse raced,” we believe that the meaning has been constrained and fix on an interpretation. Later, when that interpretation proves wrong, we are stuck.

[…]

If, shortly after we hear a word, another word is flashed on a screen, we recognize the second word faster if it is related to the first. The mind is primed to receive it. Apparently, we are propagating expectations, as in the constraint propagation approach. However, if the priming (first) word has multiple unconnected meanings, it primes related words for all of its meanings.

300: The sentence “The defendant examined by the lawyer turned out to be unreliable” has a garden-bath quality until we read the word by: we may first have posited that the defendant was examining rather than being examined. Readers of this sentence glance back to the beginning when they hit he word by, presumably checking their understanding. by contrast, the sentence “The evidence examined by the lawyer turned out to be unreliable” is much easier to parse, and readers’ eyes do not glance back to check understanding. The mind must be exploiting its understanding of semantics to make this distinction (Pinker 1994).

304: Our DNA is a program that has been evolved and thus encodes much knowledge (the first part), and this program itself codes for a program that learns/ But this learning does not proceed from a blank slate. I argue that the crucial element in our learning during life is encoded in the DNA, which predisposes and guides our learning.

…There is thus a smooth gradation between simple development and learning.

…Like our development, this learning is predetermined in the sense that it is reliable, fast, and automatic, and would not occur absent the DNA programming.

305: Computational learning theory tells us that one cannot learn without an appropriate inductive bias and that the particular inductive bias chosen is critical to what one can learn.

…Also, as discussed in Chapter 11, learning is a computationally hard problem. It requires vast computational resources. But the overwhelming majority of the computational resources have been applied to evolving the DNa, not to learning during life. Most of what we learn, we learn rapidly with little computation. Moreover, the amount of computation that went into evolution is truly vast. Not only have there been many, many trials, but each trial involves a life interacting with the world. The amount of computation in even a single such trial is immense…

307: Many linguists of the Chmosky school argue that our DNA contains code for a grammar module with about 100 binary switches that becomes the grammar for English upon appropriate choice of these switches and the grammar for Swahili upon a different choice.

…each of us effortlessly and automatically learns roughly ten words a day throughout our childhood (Bloom 2000).

308: The DNA does not code directly for the heart or for the brain in a transparent way. Rather the DNA codes for a complex chemical process that when executed results in the construction of the heart and the brain.

310: We are not born with breasts or teeth, yet no one would say we learn these.

…Instead, there is a smooth gradation, with some things that depend more on the environment or that lead to more visible differences between creatures considered to be learning, and other changes that are more independent of the environment or vary less from creature to creature considered to be development.

311: C. elegans is a simple worm with a fixed nervous system.

312: Classic experiments by Hubel and Wiesel (1962) showed that if one of a kitten’s eyes is covered in the first few months of its life, the visual cortex will never form proper connections to see through that eye. Blakemore and Coper (1970) showed that if kittens are raised in a room with only vertical black and white lines, they forever lose the ability to see horizontal objects.

313: Moreover, there are obvious reasons why using visual stimulus in designing the cortex should be helpful. One example is stereo vision. Stereo vision allows estimation of depths. It relies on the fact that objects project differently on to the two eyes. The disparity, the extent to which the projection is different, depends on the distance of the object. the mind uses this fact, which arises from simple geometry, to estimate distances. But this effect also depends on the length of the separation between the two eyes. To utilize stereo vision, depth perception thus has to be accurately tuned to this distance.

But the width of the eyes may be controlled by some other gene or genes than are controlling the visual cortex. Moreover, the width between the eyes is controlled in part by how big the creature grows, which depends on how much food it gets to eat and on how good the food is. The only sensible way to design a creature, the only way it will arrive at a system that is evolutionarily robust as genes are swapped around and mutated, and as food conditions improve and worse, is to adjust the cortex during development, taking into account how wide apart the eyes are. That is, the only way to get depth perception right is to effectively learn the width between the eyes, embedding this learning in the development.

314: In 1797, a 12-year-old “wolf-child” was found in France, a child deprived of human contact from infancy, presumably reared by wolves. By the time of his death nearly thirty years later, he had learned only a few words. Numerous other examples of such children exist, including Genie, an abused child raised in a closet in Los angeles, discovered in 1970 at age of 13, who later learned a reasonable vocabulary but never learned to form grammatical sentences (see Pinker 1994).

315: What determines whether a cell grows into a liver cell or a skin cell is different interactions of the DNA with its chemical environment. Typically, the critical factor is the environment of the DNA within the cell, where the presence or absence of various chemicals (chiefly signaling proteins) turns on or off genetic regulatory networks.

[Somewhat dubious:] 320: By contrast, when computer scientists come to write learning algorithms, they most naturally treat the world as a string of n bits. A string of bits has no particular topology. There is no necessary notion that the third bit is near the fourth. There is only a collection of 2^n^ points with no ordering whatsoever. If we want to write a learning program for the game of Go or chess, say, we might as computer scientists naturally start by encoding the world as a string of bits. If we do this, we lose any notion that one point on the board is close to another point.

…Surely, when a person learns to play chess or Go, he brings an enormous amount of topological knowledge to it…

324: Evolution became smarter about searching in other ways. Maynard Smith and Szathmary (1999) have suggested that evolution has learned new methods of manipulating information, passing through eight major transitions: from “replicating molecules to populations of molecules in compartments,” from “independent replicators to chromosomes,” from “RNA as gene and enzyme to DNA and proteins,” from “prokaryotes to eukaryotes,” from “asexual populations to sexual populations,” from “protists to animals, plants, and fungi,” from “solitary individuals to colonies,” and from “primate societies to human societies and the origin of language” (16-19).

327: When the monkeys were shown, just one time, a video of a monkey reacting in terror to a snake, they developed a fear of snakes. When they were shown instead a video of a monkey reacting in terror to a flower, they did not develop a fear of flowers.

…Take two mobbing birds and put them in nearby cages. Show bird A a stuffed owl and at the same time show bird B a stuffed nectar-feeding bird it has never seen before. Bird A will make the mobbing call, and from this, bird B will learn to mob the nectar-feeding bird. Bird B can then pass on this mobbing of nectar-feeding birds to other birds the same way. The birds are not particularly predisposed to mob particular kinds of birds; in fact, they can be taught in this way to mob a laundry detergent bottle. But they are pre-programmed to learn to mob whatever potential predator they see the first time they see or hear another bird mob it (Gould and Gould 1994).

…Experiments show that it is predisposed to attend to songs with the right syllable structure.

328: Vervet monkeys are predisposed to learn to make warning calls in the presence of predators. Vervets have about four calls that are understood by other vervets as warning for particular predators. The eagle call causes vervets to take cover against a flying predator, retreating from exposed tops of trees. The snake call is ignored by vervets in trees but causes vervets on the ground to stand on their hind legs and look for snakes. The particular calls made are innate; the sound does not vary from region to region. However, the occasion for calling, and the behavior upon hearing the call, must be learned Young vervets instinctively make alarm calls in response to a range of stimuli. For example, young vervets will make the eagle call in response to a range of flying objects, including a stork or even a falling leaf. But, in time, the vervets learn to call as the adults do, at the locally present predators only. In one region they may respond to eagles and in another tao hawks; in one region they may call to baboons and in another to dogs (Hauser 2000; Gould and Gould 1994).

329: Another place information is encoded, however, is in the regulatory regions, e.g., the promoter or repressor patterns upstream of the genes that determine when the genes are expressed. Almost all human proteins are quite similar to those of other creatures, but the timing of expression is somewhat different. Much of the evolution of development that has occurred since the first bilaterally symmetric animals is thought to have been evolution of the regulatory regions and thus evolution of the genomic networks that control the timing of gene expression (Carroll, Grenier, and Weatherbee 2001).

331: In a paper that improved on ideas proposed by J. Mark Baldwin in 1896, Hinton and Nowlan (1987) suggested a mechanism by which learning during life could feed back information into the genome. Suppose there is some adaptation a creature could acquire that would improve fitness, and that accomplishing this adaptation requires discovering the setting for 30 binary switches, and unless they are all correct, the creature doesn’t get any benefit at all. It would be very hard for evolution to find such an adaptation, because it could only discover it through search, not hill climbing. Evolution cannot get started on hill climbing because it does not get any feedback until it gets all settings right. Sot it has to create roughly 2^30^, or a billion, creatures with different settings before it is likely to create one with the right settings. Even when a creature is created with the right settings, evolution is not home free. The creature may not survive, or its children may not share its settings. In the (perhaps unrealistic) case where having most settings right does not help (the adaptations is not useful unless all the settings are right), the genome can easily find the correct settings in one individual, lose them in her children, and then undergo a random walk with no driving force back toward the correct settings. If it randomly walks any distance, it is almost like beginning the search from scratch again. The genome will have to find the settings many times before they are prevalent in the population, and even finding them once may require a prohibitive search.

But now suppose that the creatures have the ability to set 10 of these switches during life and that they search through the settings of these switches looking for a correct setting. Now evolution only has to set 20 of the switches correctly and the creature will do the rest. Evolution only has to create a million or so creatures to get these right, not a billion.

Moreover, the more switches that are preset correctly, the fewer learning trials the creature needs, thus the faster it can learn the concept, and the more likely the creature is to learn it. Thus, a creature with 22 switches present correctly will be fitter than one with 21 switches preset correctly, which in turn will be fitter than one with 20 switches preset correctly. So, now evolution gets feedback and can hill-climb to set the switches correctly in the genome. This feedback from learned knowledge into the genome is called the Baldwin effect.

334: Examining such runs in detail, they observed explicit manifestations of the Baldwin effect: populations began by learning important criteria such as to move toward food, and later evolution pushed these behaviors into the genome so that the creatures were born with the behavior.

…one can more compactly and easily specify behaviors by specifying goals than by specifying actions.

335: …With instruction from the parent, the child may be able to fix 20 settings immediately with no search whatsoever. Exploiting this kind of phenomenon, evolution only has to discover a creature program that enables useful knowledge to be passed on from parent to child, and some creature has once to discover the knowledge. From then it can reliably be passed on. So enormous parallelism is obtained: concepts that can once be discovered by some creature (and communicated) can be passed to all its descendants. Then, even when the adaptation requires that too many switches be set for any given creature to reasonably be able to discover them (say, in our example, 20 instead of 10), the adaptation can still be locked in if once discovered. So evolution can discover things that involved even greater searches, setting more during life than might otherwise be expected.

Such exploitation of culture is omnipresent in the animal kingdom. Compare, for example, the Alaskan brown bear and the grizzly bear. The two creatures are genetically indistinguishable and often live within a mile of each other. However, they look substantially different, the brown bear being much bigger and heavier with huge shoulder muscles. The difference stems solely from parental instruction. The brown bears live on the coast and have been instructed by their parents how to harvest the rich food sources there; thus they eat better and behave differently. So, the difference between the brown bear and the grizzly is due to nurture, not nature.

337: One possible consequence of this is expressed in the saying, Ontogeny recapitulates phylogeny, that is, development of creatures seems to pass through different stages in their evolutionary history. As a human embryo develops, for example, it passes through stages where it has gills and then a tail, both of which later go away.

338: our learning of language is so biased in and so automatic, so turned on and off by our genes, that in many ways it makes more sense to consider it development rather than learning.

…Experiments have shown that, like the birds that can identify their own species’ language by its syllable structure, human infants innately recognize the more than two dozen consonant sounds of human speech, including those not present in the language they are learning (Gould and Gould 1994, 203-209).

339: Another school of linguists has proposed an even more powerful model of inductive bias, dubbed optimality theory (Tesar and Smolensky 2000). In this model, grammar consists of a collection of constraints. These constrain word order, phonology, agreement among the parts of speech, and everything else necessary to determine a human grammar. The set of constraints is identical from language to language and built into the human genome. What differs from language to language is the ordering of the constraints… To speak grammatically, one must obey as many high-order constraints as possible, but one violates lower-order constraints whenever it is impossible to satisfy them without violating a higher-ranking constraint.

342: When groups of adults with no common language come together, they develop what is known as a pidgin. A pidgin has no grammar. The word order is unimportant, and even moderately complex statements cannot be accurately expressed. Such constructions of pidgins have happened numerous times, particularly when workers from around the world have been imported into some location such as Hawaii. The children of these workers grow up hearing the pidgin, and they speak a new language that is called a creole. The creole is a whole new language with, initially, the lexicon of the pidgin (somewhat extended) but with a full grammar.

351: E. coli cannot precisely control the direction it swims: all it can do is swim forward. How does it navigate? If it decides it “likes” the direction in which it is swimming, it swims (forward). If it decides it does “not like” the direction in which it is swimming, it modifies a protein (to be precise, it phosphorylates, attaches a phosphorous atom to a protein called CheY), which stops the motor. It then tumbles and resumes swimming in a random direction. By swimming purposefully toward attractive stimuli and undergoing a random walk otherwise, it is capable of directing itself efficiently. Its speed in terms of body lengths per second, scaled up to human size, would be about 50 miles per hour.

353: The squid use the luminescence to avoid casting a shadow as they cruise the sea floor looking for prey in the moonlight.

355: Bees can communicate with each other using a dance that indicates to watching bees the direction and distance of food, the quality of the food, and incidentally its odor, which is communicated to the observing bees from odor sticking to the waxy hairs of the dancing foragers. The communication makes use of an elaborate but mechanical code that maps the bees’ sensory input straightforwardly into polar coordinates. If the distance to the food is less than 75 meters, the bees dance a “round dance,” and if it is further, they dance a “waggle dance.” In the waggle dance, distance is indicated by the rate at which the bees waggle. The angle between the direction one must fly and the sun is indicated by the direction of the axis of the dance. A bee may continue dancing for several hours, during which time the sun moves across the sky. The dancing bee rotates the axis of the dance so that the angular direction remains correct as the sun moves, even though they dance inside the hive where they can’t see the sun.

357: The decision of where to start a new hive, for example, is collective. Scouts search for possible sites, and when they find a promising candidate, return to the hive and dance to indicate the location of the proposed site to other bees. They then return to the site, and if it still pleases them, come back and dance again. But, also, these same scouts leave off returning to the same site and speaking to its merits to go and observe the dances of other scouts. They will visit the sites found by other scouts, as indicated in their dances, and weight the merits of the other sites compared to their own. Again the individual bees weigh a host of factors, including cavity volume, shape, entrance direction, dampness, and draftiness. After the scouts look at the other sites and compare them, and go back and forth to compare the most popular sites for a while, a consensus builds, with all the scouts going back and forth from one preferred site. Then the swarm goes off en masse to found a new hive.

361: Even assuming that the bees already understand concepts of “same” and “different,” the fact that they realize after only a handful of trials hat this is what they are being paid for implies that the bees consider these concepts quite high on the list of possible hypotheses for how to find food.

363: There are (at least) two ways one may imagine coming up with such a new module: discovering it or being instructed. It is reasonable to conjecture that these are often closely related, in the following sense. When instructed, say, to build a recursion module, a learner must still figure out how to construct the code, how to connect which preexisting modules in what way. The verbal instructions do not specify exactly how to wire the brain to solve new problems with recursion; rather, the learner sees examples of problems like Blocks World solved with recursion, and he solves simple examples in oder to learn. The burden of actually constructing the code in mind still falls upon the learner. The instruction may be mainly suggesting intermediate steps, so the code does not have to be constructed all at once; or the instruction may consist of guidance that the learner is on the right path, thus reducing the size of any search to be done and inducing him to continue with the construction.

364: In 1984 the psychologists Bennett Galef, Jr., and David Sherry showed that chickadees would learn to peel back foil tops from milk bottles if given the example of a peeled back top without seeing the peeling process. This should not be surprising; peeling bark and pecking are routines birds know well. The key to the discovery was presumably just learning that there was food to be had in milk bottles. Once this subgoal was discovered, probably by a bird randomly peeling back the foil (and tits are known to peel back wallpaper on occasion), constructing the behavior of pecking open the foil and drinking was straightforward, involving a very short chain of existing routines (Gould and Gould 10994, 74-76; Hauser 2000, 130).

A similar example is provided by rats learning to consume Jerusalem pine cones. The tough cones protect tasty seeds. Naive rats are unable to extract the seeds efficiently, although they know there is food to be had, and will sometimes demolish the cones in an effort to extract the seeds. Rats raised by parents who know how to peel back the layers pick up the effective technique. But, alternatively, naive rats can learn the technique in the laboratory if given a partially stripped pine cone. Evidently, the technique is too long and involved to discover from scratch, but given an intermediate subgoal, the animals can construct a program.

…But much or perhaps all such learning could plausibly be explained simply in terms of cutting down the size and complexity of a program that needs to be constructed in one chunk.

[Lack of inductive bias:] 368: The philosopher Willard Van Orman Quine (1960) famously asked how one could ever possibly learn any word. Say you went to a foreign land, and a native there was showing you around, and all of a sudden a gazelle ran across the field. Your guide shouts out, “Gavagai!” You know think you know the word for gazelle, but as Quine pointed out, this could instead logically be the word for running, or for a gazelle’s left foreleg, or for either a gazelle running or a bear asleep, or an infinitude of other possibilities.

[Compare “There goes a Gavagai running” and “There goes an antelope Gavagai.”]

373: Moreover, I don’t believe people need to use language to think about things that aren’t present or to reason more generally. I can, for example, easily visualize my backyard and mentally walk around in it to figure out what my kids’ treehouse looks like from various angles, and the whole computation seems completely nonverbal. Similarly, numerous mathematicians, introspecting about their reasoning processes, have said the process is nonverbal. Einstein, for example, said, “Words and language, whether written or spoken, do not seem to play any part in my thought processes. The psychological entities that serve as building blocks for my thought are certain signs or images, more or less clear, that I can reproduce and combine at will” (Devlin 2000, 124).

378: Finally, it’s worth noting that in our technology and culture, people have invented several modules, several inductive biases, that have greatly aided the search for new progress. These include discoveries such as the axiomatic method and the scientific method… Other discoveries, like money and markets and printing presses, can also be seen as greatly facilitating the evolution of ideas.

380: These arguments suggest that there might be an evolutionary hurdle to get over in order to develop multiphoneme words or speech with syntax. Vervet monkeys, who can vonvey a relatively small number of messages such as “eagle,” “leopard,” and “snake,” are on one side of this hurdle, evolving to convey their limited repertoire effectively using single-word messages. They are perhaps stuck in an evolutionary dead end where they can’t convey many words because they don’t break words down into phonemes, and can’t convey flexible messages because they don’t use syntax, but they can’t evolve to use digital encoding because they are not conveying enough messages. With the number of messages they are conveying, it is fitter to use single=word utterances, and with single-word utterances it is harder to use may of them, so they are evolutionarily trapped. Once we break through the barrier by needing to convey enough messages, it becomes fitter to use syntax to convey them, and then we can smoothly evolve all the flexibility of human language.

381: But if one species adopts the behavior of word learning, if some individuals start expecting others to learn words they use, and they start expecting others to use words that they therefore should learn, they pass over the hurdle and could swiftly evolve to improve their abilities.

…A subcontractor agent cannot compute something useful until the system knows how to use the result, and the system cannot know how to use the result until the result is computed.

410: Consider, for example, the color phi phenomenon (Kolers and von Grunau 1976). If two differently located, differently colored spots are lit for 150 milliseconds each, first one and then the other with a 150 millisecond interval between, you subjectively see one spot moving from the position of the first to the position of the second and changing color midway through.

412: What you actually have in your mind is a model that is not at all detailed. When you want detail about any given region, say, if you really want to know that there is print on a particular portion of this page, you flick your fixation point there and look closely. Each flick of your fixation point, which you do unconsciously two or three times a second, is called a saccade. The mental model you maintain contains summary information produced by darting your eyes around and filling in based on your knowledge about the world. If you are looking at a forest, and you see leaves at your fixation point, and saccade (move your fixation point to a number of places) and see leaves at each point, you fill in your mental model with “there are leaves everywhere.” This contains very little information: you do not actually have in your mind all the bits that would be necessary to specify the positions of all the separate leaves. The information you have is that there is a big region with leafy-looking texture. The information you have about this page is that there is a big region of wordy-looking texture, not any knowledge about any particular words that you have not foveated. The picture you have in your mind is thus a summary picture, breaking the world up into objects and listing just a little information about each object.

418: Possibly short-term memory is so limited because we have the capability of applying all our modules to the tokens in short-term memory, and the wiring for this may be expensive.

419: Hopfield (1982) proposed a specific model of memory storage that viewed brain circuitry as a dynamic system and memories as attractors placed into that system. Such an attractor can be viewed as a small valley in a surface like a fitness landscape. The memory is then retrieved by placing the system anywhere in the valley and allowing it to flow to the bottom, retrieving the memory. But as new memories are inserted into such a system (which can be viewed as a process of pushing down the surface to make a new valley), unwanted valleys, corresponding to spurious memories, are inadvertently created between desired valleys. Crick and Mitchison’s proposal is that dreams are a process by which such unwanted modes are unlearned.

425: The tastes of fruits are co-evolved as well: edible fruits have evolved attractive tastes because being eaten dispenses their seeds to the benefit of their genes (Pollan 2002). Another interesting example of a scented flower is the black lily, which smells like fecal matter. It was evolved that way because it is pollinated by flies, who are attracted to that smell.

427: In brief, the argument is this. AIT defines the information in a string of bits as the length of the smallest computer program that would print the string and then halt (information in AIT is identical to the definition of minimum description length; see section 4.2). It follows immediately from this definition that no computer program, and equivalently no mathematical proof, can generate information. Any string that the computer program generates, even if it is longer than the computer program itself, obviously has no more information than the program did.